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(57) ABSTRACT 

The invention provides methods and apparatus for multiple 
user detection (MUD) processing that have application, for 
example, in improving the capacity CDMA and other wire- 
less base stations. One aspect of the invention provides a 
multiprocessor, multiuser detection system for detecting 
user transmitted symbols in CDMA short-code spectrum 
waveforms. A first processing element generates a matrix 
(hereinafter, "gamma matrix") that represents a correlation 
between a short-code associated with one user and those 
associated with one or more other users. A set of second 
processing elements generates, e.g., from the gamma matrix, 
a matrix (hereinafter, "R-matrix") that represents cross- 
correlations among user waveforms based on their ampli- 
tudes and time lags. A third processing element produces 
estimates of the user transmitted symbols as a function of the 
R-matrix. 
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WIRELESS COMMUNICATIONS SYSTEMS AND 
METHODS FOR DIRECT MEMORY ACCESS AND 
BUFFERING OF DIGITAL SIGNALS FOR 
MULTIPLE USER DETECTION 

BACKGROUND OF THE INVENTION 

[0001] This application claims the benefit of priority of (i) 
U.S. Provisional Application Serial No. 60/275,846 filed 
Mar. 14, 2001, entitled "Improved Wireless Communica- 
tions Systems and Methods"; (ii) U.S. Provisional Applica- 
tion Ser. No. 60/289,600 filed May 7, 2001, entitled 
"Improved Wireless Communications Systems and Methods 
Using Long-Code Multi-User Detection" and (iii) U.S. 
Provisional Application Ser. No. 60/295,060 filed Jun. 1, 
2001 entitled "Improved Wireless Communications Systems 
and Methods for a Communications Computer," the teach- 
ings all of which are incorporated herein by reference. 

[0002] The invention pertains to wireless communications 
and, more particularly, by way of example, to methods and 
apparatus providing multiple user detection for use in code 
division multiple access (CDMA) communications. The 
invention has application, by way of non-limiting example, 
in improving the capacity of cellular phone base stations. 

[0003] Code-division multiple access (CDMA) is used 
increasingly in wireless communications. It is a form of 
multiplexing communications, e.g., between cellular phones 
and base stations, based on distinct digital codes in the 
communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple 
access and time-division multiple access, in which multi- 
plexing is based on the use of orthogonal frequency bands 
and orthogonal time-slots, respectively. 

[0004] A limiting factor in CDMA communication and, 
particularly, in so-called direct sequence CDMA (DS- 
CDMA), is interference — both that wrought on individual 
transmissions by buildings and other "environmental" fac- 
tors, as well that between multiple simultaneous communi- 
cations, e.g., multiple cellular phone users in the same 
„ geographic area using their phones at the same time. The 
latter is referred to as multiple access interference (MAI). 
Along with environmental interference, it has effect of 
limiting the capacity of cellular phone base stations, driving 
service quality below acceptable levels when there are too 
many users. 

[0005] A technique known as multi-user detection (MUD) 
is intended to reduce multiple access interference and, as a 
consequence, increases base station capacity. It can reduce 
interference not only between multiple transmissions of like 
strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users 
(the so-called near/far problem). MUD generally functions 
on the principle that signals from multiple simultaneous 
users can be jointly used to improve detection of the signal 
from any single user. Many forms of MUD are discussed in 
the literature; surveys are provided in Moshavi, "Multi-User 
Detection for DS-CDMA Systems," IEEE Communications 
Magazine (October, 1996) and Duel-Hallen et al, "Multiuser 
Detection for CDMA Systems," IEEE Personal Communi- 
cations (April 1995). Though a promising solution to 
increasing the capacity of cellular phone base stations, 
MUJD techniques are typically so computationally intensive 
as to limit practical application. 



[0006] An object of this invention is to provide improved 
methods and apparatus for wireless communications. A 
related object is to provide such methods and apparatus for 
multi-user detection or interference cancellation in code- 
division multiple access communications. 

[0007] A further related object is to provide such methods 
and apparatus as provide improved short-code and/or long- 
code CDMA communications. 

[0008] A further object of the invention is to provide such 
methods and apparatus as can be cost-effectively imple- 
mented and as require minimal changes in existing wireless, 
communications infrastructure. 

[0009] A still further object of the invention is to provide 
methods and apparatus for executing multi-user detection 
and related algorithms in real-time. 

[0010] A still further object of the invention is to provide 
such methods and apparatus as manage faults for high- 
availability. 

SUMMARY OF THE INVENTION 

[0011] The foregoing and other objects are among those 
attained by the invention which provides methods and 
apparatus for multiple user detection (MUD) processing. 
These have application, for example, in improving the 
capacity CDMA and other wireless base stations 

Wireless Communications Systems and Methods 
for Multiple Processor Based Multiple User 
Detection 

[0012] One aspect of the invention provides a multiuser 
communications device for detecting user transmitted sym- 
bols in CDMA short-code spread spectrum waveforms. A 
first processing element generates a matrix (hereinafter, 
"gamma matrix") that represents a correlation between' a 
short-code associated with one user and those associated 
with one or more other users. A set of second processing 
elements generates, e.g., from the gamma matrix, a matrix 
(hereinafter, "R-matrix") that represents cross-correlations 
among user waveforms based on their amplitudes and time 
lags. A third processing element produces estimates of the 
user transmitted symbols as a function of the R-matrix. 

[0013] In related aspects, the invention provides a mul- 
tiuser communications device in which a host controller 
performs a "partitioning function," assigning to each second 
processing element within the aforementioned set a portion 
of the R-matrix to generate. This partitioning can be a 
function of the number of users and the number of process- 
ing elements available in the set. According to related 
aspects of the invention, as users are added or removed from 
the spread spectrum system, the host controDer performs 
further partitioning, assigning each second processing ele- 
ment within the set a new portion of the R-matrix to 
generate. 

[0014] Further related aspects of the invention provide a 
multiuser communications device as described above in 
which the host controDer is coupled to the processing 
elements by way of a multi-port data switch. Still further 
related aspects of the invention provide such a device in 
which the first processing element transfers the gamma- 
matrix to the set of second processing elements via a 
memory element. 
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[0015] Similarly, the set of second processing elements 
place the respective portions of the R-matrix in memory 
accessible to the third processing element via the data 
switch. Further related aspects of the invention provide 
devices as described above in which the host controller 
effects data flow synchronization between the first process- 
ing element and the set of second processing elements, as 
well as between the set of second processing elements and 
the third processing element. 

Wireless Communications Systems and Methods 
for Contiguously Addressable Memory Enabled 
Multiple Processor Based Multiple User Detection 

[0016] Another aspect of the invention provides a mul- 
tiuser communications device for detecting user transmitted 
symbols in CDMA short-code spread spectrum waveforms 
in which a set of first processing elements generates a matrix 
(hereinafter the "R-matrix") that represents cross-correla- 
tions among user waveforms based on their amplitudes and 
time lags. The first processing elements store that matrix to 
contiguous locations of an associated memory. 

[0017] Further aspects of the invention provide a device as 
described above in which a second processing element, 
which accesses the contiguously stored R-matrix, generates 
estimates of the user transmitted symbols. 

[0018] Still further aspects of the invention provide such a 
device in which a third processing element generates a 
further matrix (hereinafter, "gamma-matrix") that represents 
a correlation between a CDMA short-code associated with 
one user and those associated with one or more other users; 
this gamma-matrix used by the set of first processing ele- 
ments in generating the R-matrix. In related aspects, the 
invention provides such a device in which the third process- 
ing element stores the gamma-matrix to contiguous loca- 
tions of a further memory. 

[0019] In other aspects, the invention provides a multiuser 
device as described above in which a host controller per- 
forms a "partitioning function" of the type described above 
that assigning to each processing element within the set a 
portion of the R-matrix to generate. Still further aspects 
provide such a device in which the host controller is coupled 
to the processing elements by way of a multi-port data 
switch. 

[0020] Other aspects of the invention provide such a 
device in which the third processing element transfers the 
gamma-matrix to the set of first processing elements via a 
memory element. 

[0021] Further aspects of the invention provide a multiuser 
communications device as described above with a direct 
memory access (DMA) engine that places elements of the 
R-matrix into the aforementioned contiguous memory loca- 
tions. 

[0022] Further aspects of the invention provide methods 
for operating a multiuser communications device paralleling 
the operations described above. 

Wireless Communications Systems and Methods 
for Cache Enabled Multiple Processor Based 
Multiple User Detection 

[0023] Other aspects of the invention provide a multiuser 
communications device that makes novel use of cache and 



random access memory for detecting user transmitted sym- 
bols in CDMA short-code spectrum waveforms. According 
to one such aspect, there is provided a processing element 
having a cache memory and a random access memory. A 
host controller places in the cache memory data represen- 
tative of characteristics of the user waveforms. The process- 
ing element generates a matrix as a function of the data 
stored in the cache, and stores the matrix in either the cache 
or the random access memory. 

[0024] Further aspects of the invention provide a device as 
described above in which the host controller stores in cache 
data representative of the user waveforms short-code 
sequences. The processing element generates the matrix as 
a junction of that data, and stores the matrix in random 
access memory. 

[0025] Still further aspects of the invention provide such a 
device in which the host controller stores in cache data 
representative of a correlation of time-lags between the user 
waveforms and data representative of a correlation of com- 
plex amplitudes of the user waveforms. The host controller 
farther stores in random access memory data representing a 
correlation of short-code sequences for the users waveforms. 
The processing element generates the matrix as a function of 
the data and stores that matrix in RAM. 

[002 6] Further aspects of the invention provide a device as 
described above in which a host controller stores in cache an 
attribute representative of a user waveform, and stores in 
random access memory an attributes representing a cross- 
correlation among user waveforms based on time-lags and 
complex amplitudes. The processing element generates esti- 
mates of user transmitted symbols and stores those symbols 
in random access memory. 

[0027] Other aspects of the invention provide such a 
device in which the host controller transmits the matrix 
stored in the cache or random access memory of a process- 
ing element to the cache or random access memory of a 
further processing element. 

[0028] Further aspects of the invention provide a multiuser 
communications device as described above with a multi-port 
data switch coupled to a short-code waveform receiver 
system and also coupled to a host controller. The host 
controller routes data generated by the receiver system to the 
processing element via the data switch. 

[0029] Further aspects of the invention provide methods 
for operating a multiuser communications device paralleling 
the operations described above. 

Wireless Communications Systems and Methods 
for Nonvolatile Storage Of Operating Parameters 
For Multiple Processor Based Multiple User 
Detection 

[0030] Another aspect of the invention provides a mul- 
tiuser communications device for detecting user transmitted 
symbols in CDMAshort-code spectrum waveforms in which 
fault and configuration information is stored to a nonvolatile 
memory. A processing element, e.g. that performs symbol 
detection, is coupled with random access and nonvolatile 
memories. A fault monitor periodically polls the processing 
element to determine its operational status. If the processing 
element is non-operational, the fault monitor stores infor- 
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mation including configuration and fault records, as well at 
least a portion of data from the processing element's RAM, 
into the nonvolatile memory. 

[0031] According to further aspects according to the 
invention, following detection of the non-operational status, 
the fault monitor sends to a host controller a reset-request 
interrupt together with the information stored in the non- 
volatile RAM. In turn, the host controller selectively issues 
a reset command to the processing element. In related 
aspects, the processing element resets in response to the 
reset command and transfers (or copies) the data from the 
nonvolatile memory into the RAM, and therefrom continues 
processing the data in the normal course. 

[0032] Further aspects of the invention provide a device as 
described above in which the processing element periodi- 
cally signals the fault monitor and, in response, the fault 
monitor polls the processing element. If the fault monitor 
does not receive such signaling within a specified time 
period, it sets the operational status of the processing ele- 
ment to non-operational. 

[0033] According to a related aspect of the invention, the 
fault monitor places the processing elements in a non- 
operational status while performing a reset. The fault moni- 
tor waits a time period to allow for normal resetting and 
subsequently polls the processor to determine its operational 
status. 

[0034] Still further aspects of the invention provide a 
device as described above in which there are a plurality of 
processing elements, each with a respective fault monitor. 

[0035] Yet still further related aspects of the invention 
provide for the fault monitoring a data bus coupled with the 
processing element. 

[0036] Further aspects of the invention provide methods 
for operating a multiuser communications device paralleling 
the operations described above. 

Wireless Communications Systems and Methods 
for Multiple Operating System Multiple User 
Detection 

[0037] Another aspect of the invention provides a mul- 
tiuser communications device for detecting user transmitted 
symbols in CDMA short-code spectrum waveforms in which 
a first process operating under a first operating system 
executes a first set of communication tasks for detecting the 
user transmitted symbols and a second process operating 
under a second operating system — that differs from the first 
operating system— executes a second set of tasks for like 
purpose. A protocol translator translates communications 
between the processes. According to one aspect of the 
invention, the first process generates instructions that deter- 
mine how the translator performs such translation. 

[0038] According to another aspect of the invention, the 
first process sends a set of instructions to the second process 
via the protocol translator. Those instructions define the set 
of tasks executed by the second process. 

[0039] In a related aspect of the invention, the first process 
sends to the second process instructions for generating a 
matrix. That can be, for example, a matrix representing any 
of a correlation of short-code sequences for the user wave- 
forms, a cross-correlation of the user waveforms based on 



time-lags and complex amplitudes, and estimates of user 
transmitted symbols embedded in the user waveforms. 

[0040] Further aspects of the invention provide a device as 
described above in which the first process configures the 
second process, e.g., via data sent through the protocol 
translator. This can include, for example, sending a configu- 
ration map that defines where a matrix (or portion thereof) 
generated by the second process is stored or otherwise 
directed. 

[0041] Still further aspects of the invention provide a 
device as described above in which the first process is 
coupled to a plurality of second processes via the protocol 
translator. Each of the latter processes can be configured and 
programmed by the first process to generate a respective 
portion of a common matrix, e.g., of the type described 
above. Further aspects of the invention provide methods for 
operating a multiuser communications device paralleling the 
operations described above. 

Wireless Communications Systems and Methods 
for Direct Memory Access. And Buffering of Digital 
Signals for Multiple User Detection 

[0042] Another aspect of the invention provides a mul- 
tiuser communications device for detecting user transmitted 
symbols in CDMAshort-code spectrum waveforms in which 
a programmable logic device (hereinafter "PLD") enables 
direct memory access of data stored in a digital signal 
processor (hereinafter "DSP"). The DSP has a memory 
coupled with a DMA controller that is programmed via a 
host port. The PLD programs the DMA controller via the 
host port to allow a buffer direct access to the memory. 

[0043] In a related aspect according to the invention, the 
PLD programs the DMA controller to provide non-frag- 
mented block mode data transfers to the buffer. From the 
buffer, the PLD moves the blocks to a data switch that is 
coupled to processing devices. In a further related aspects 
according to the invention, the PLD programs the DMA 
controller to provide fragmented block mode data transfers 
utilizing a protocol. The PLD provides the protocol which 
fragments and unf ragments the blocks prior to moving them 
to the data switch. 

[0044] In further aspects provided by a device as described 
above, the PLD is implemented as a field programmable gate 
array that is programmed by a host controller coupled with 
the data switch. In a related aspect, the PLD is implemented 
as a application specific integrated circuit which is pro- 
grammed during manufacture. In still aspects, a device as 
described above provides for a buffer implemented as a set 
of registers, or as dual-ported random access memory. 

[0045] Further aspects of the invention provide methods 
for operating a multiuser communications device paralleling 
the operations described above. 

Improved Wireless Communications Systems and 
Methods for Short-code Multiple User Detection 

[0046] Still further aspects of the invention provide meth- 
ods for processing short code spread spectrum waveforms 
transmitted by one or more users including the step of 
generating a matrix indicative of cross correlations among 
the waveforms as a composition of (i) a first component that 
represents correlations among time lags and short codes 
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associated with the waveforms transmitted by the users, and 
(ii) a second component that represents correlations among 
multipatb signal amplitudes associated with the waveforms 
transmitted by the users. The method further includes gen- 
erating detection statistics corresponding to the symbols as 
a function of the correlation matrix, and generating estimates 
of the symbols based on those detection statistics. 

[0047] Related aspects of the invention provided methods 
as described above in which the first component is updated 
on a time scale that is commensurate with a rate of change 
of the time lags associated with the transmitted waveforms, 
and the second component is updated on a different time 
scale, i.e., one that is commensurate with a rate of change of 
the multipath amplitudes associated with these waveforms. 
In many embodiments, the updating of the second compo- 
nent, necessitated as a result of change in the multipath 
amplitudes, is executed on a shorter time scale than that of 
updating the first component. 

[0048] Other aspects of the invention provide methods as 
described above in which the first component of the cross- 
correlation matrix is generated as a composition of a first 
matrix component that is indicative of correlations among 
the short codes associated with the respective users, and a 
second matrix component that is indicative of the wave- 
forms transmitted by the users and the time lags associated 
with those waveforms. , 

[0049] In a related aspect, the invention provides methods 
as above in which the first matrix component is updated 
upon addition or removal of a user to the spread spectrum 
system. This first matrix component (referred to below as 
T-matrix) can be computed as a convolution of the short 
code sequence associated with each user with the short 
codes of other users. 

[0050] According to further aspects of the invention, ele- 
ments of the T-matrix are computed in accord with the 
relation: 




[0051] wherein 

[0052] C*j[n] represents the complex conjugate of a 
short code sequence associated with the Ith user, 

[0053] CJn-m] represents a short code sequence 
associated with the kth user, 

[0054] N represents a length of the short code 
sequence, and 

[0055] N x represent a number of non-zero length of 
the short code sequence. 

[0056] In further aspects, the invention provides a method 
as described above in which the first component of the 
cross-correlation matrix (referred to below as the C matrix) 
is obtained as a function of the aforementioned I-matrix in 
accord with the relation: 



Caw' Kl = Yj g[mNc + t] • r tt [m] 



[0057] wherein 

[0058] g is a pulse shape vector, 

[0059] Nc is the number of samples per chip, 

[0060] t is a time lag, and 

[0061] T represents the T matrix, e.g., defined above. 

[0062] In a related aspect, the cross-correlation matrix 
(referred to below as the R-matrix) can be generated as a 
function of the C matrix in accord with the relation: 



L L 
9=i ^=1 



[0063] wherein 

[0064] a*^ is an estimate of a* lq the complex conju- 
gate of one multipath amplitude component of the I th 
user, 

[0065] a^. is one multipath amplitude component 
associated with the k th user, and 

[0066] C denotes the C matrix, e.g., as defined above. 

[0067] In further aspects, the invention provides methods 
as described above in which the detection statistics are 
obtained as a function of the cross-correlation matrix (e.g., 
the R-matrix) in accord with the relation: 



y,[m) = r„[0]b t [m] + £ r * E- 1 \h [m + M + 

X [rik [0] - r u [0]6 a ]b k [m] + £ r&[i\b k \m - 1] +^,[m] 



[0068] wherein 

[0069] yjm] represents a detection statistic corre- 
sponding to 111 th symbol transmitted by the 1 th user, 

[0070] r^OjbJm] represents a signal of interest, and 

[0071] remaining terms of the relation represent Mul- 
tiple Access Interference (MAI) and noise. 

[0072] In a related aspect, the invention provides methods 
as described above in which estimates of the symbols 
transmitted by the users and encoded in the short code 
spread spectrum waveforms are obtained based on the 
computed detection statistics by utilizing, for example, a 
multi-stage decision-feedback interference cancellation 
(MDFIC) method. Such a method can provide estimates of 
the symbols, for example, in accord with the relation: 
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hi[tn) m sigttjy, [m] - £ r tt [m + 1] - 



X [r ft [0]-r f [0]d ft l£ t [m] + y r* [l]fc[m - 1] [ 



[0073] wherein 

[0074] f>{m] represents an estimate of the m" 1 symbol 
transmitted by the 1 th user. 

[0075] Further aspects of the invention provide logic car- 
rying out operations paralleling the methods described 
above. 

Load Balancing Computational Methods in a 
Short-code Spread-spectrum Communications 
System 

[0076] In further aspects, the invention provides methods 
for computing the cross-correlation matrix described above 
by distributing among a plurality of logic units parallel 
tasks — each for computing a portion of the matrix. The 
distribution of tasks is preferably accomplished by partition- 
ing the computation of the matrix such that the computa- 
tional load is distributed substantially equally among the 
logic units. 

[0077] In a related aspect, a metric is defined for each 
partition in accord with the relation below. The metric is 
utilized as a measure of the computational load associated 
with each logic unit to ensure that the computational load is 
distributed substantially equally among the logic units: 

[0078] wherein 

[0079] Ai represents an area of a portion of the 
cross-correlation matrix corresponding to the par- 
tition, and 

[0080] i represents an index corresponding to the 
number of logic units over which the computation is 
distributed. 

[0081] In another aspect, the invention provides methods 
as described above in which the cross-correlation matrix is 
represented as a composition of a rectangular component 
and a triangular component. Each area, represented by A! in 
the relation above, includes a first portion corresponding to 
the rectangular component and a second portion correspond- 
ing to the triangular component. 

[0082] Further aspects of the invention provide logic car- 
rying out operations paralleling the methods described 
above. 

Hardware and Software for Performing 
Computations in a Short-code Spread-spectrum 
Communications System 

[0083] In other aspects, the invention provides an appa- 
ratus for efficiently computing a T-matrix as described 
above, e.g., in hardware. The system includes two registers, 
one associated with each of 1 th and k* users. Tbe registers 



hold elements of the short code sequences associated with 
the respective user such that alignment of the short code 
sequence loaded in one register can be shifted relative to that 
of the other register by m elements. Associated with each of 
the foregoing registers is one additional register storing 
mask sequences. Each element in those sequences is zero if 
a corresponding element of the short code sequence of the 
associated register is zero and, otherwise, is non-zero. The 
mask sequences loaded in these further registers are shifted 
relative to the other by m elements. A logic performs an 
arithmetic operation on the short code and mask sequences 
to generate, for m" 1 transmitted symbol, the (I, k) element of 
the T-matrix, i.e., r^m] 

[0084] In a related aspect, the invention provides an appa- 
ratus as described above in which the arithmetic operation 
performed by the logic unit includes, for any two aligned 
elements of the short code sequences of the 1 th and user 
and the corresponding elements of the mask sequences, (i) 
an XOR operation between the short code elements, (ii) an 
AND operation between the mask elements, (iii) an AND 
operation between results of the step (i) and step (ii). The 
result of step (iii) is a multiplier for the aligned elements, 
which the logic sums in order to generate the (1, k) element 
of the r-matrix. 

[0085] Further aspects of the invention provide methods 
paralleling the operations described above. 

Improved Computational Methods for Use in a 
Short-code Spread-spectrum Communications 
System 

[0086] In still further aspects, the invention provides 
improved computational methods for calculating the afore- 
said cross-correlation matrix by utilizing a symmetry prop- 
erty. Methods according to this aspect include computing a 
first one of two matrices that are related by a symmetry 
property, and calculating a second one of the two matrices 
as a function of the first component through application of 
the symmetry property. 

[0087] According to related aspects of the invention, the 
symmetry property is defined in accord with the relation: 

[0088] wherein 

[0089] R^m) and R^m) refer to (1, k) and (k, 1) 
elements of the cross-correlation matrix, respec- 
tively. 

[0090] Further aspects of the invention provide methods as 
described above in which calculation of the cross-correlation 
matrix further includes determining a C matrix that repre- 
sents correlations among time lags and short codes associ- 
ated with the waveforms transmitted by the users, and an 
R-matrix that represents correlations among multipath sig- 
nal amplitudes associated with the waveforms transmitted 
by the users. In related aspects the step of determining the 
C matrix includes generating a first of two C-matrix com- 
ponents related by a symmetry property. A second of the 
components is then generated by applying the symmetry 
property. 

[0091] Related aspects of the invention provide a method 
as described above including the step of generating the 
T-matrix in accord with the relation: 
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[0092] wherein 

[0093] c*j[n] represents complex conjugate of the 
short code sequence associated with the 1 th user, 

[0094] CJo-m] represents the short code sequence 
associated with kth user, 

[0095] N represents the length of the code, and 

[0096] N x represent the number of non-zero length of 
the code. 

[0097] Further aspects of the invention provide logic car- 
rying out operations paralleling the methods described 
above. 

Wireless Communications Systems and Methods 
for Virtual User Based Multiple User Detection 
Utilizing Vector Processor Generated Mapped 
Cross-correlation Matrices 

[0098] Still further aspects of the invention provide meth- 
ods for detecting symbols encoded in physical user wave- 
forms, e.g., those attributable to cellular phones, modems 
and other CDMA signal sources, by decomposing each of 
those waveforms into one or more respective virtual user 
waveforms. Each waveform of this latter type represents at 
least a portion of a symbol encoded in the respective 
physical user waveforms and, for example, can be deemed 
to "transmit" a single bit per symbol period. Methods 
according to this aspect of the invention determine cross- 
correlations among the virtual user waveforms as a function 
of one of more characteristics of the respective physical user 
waveforms. From those cross-correlations, the methods gen- 
erate estimates of the symbols encoded in the physical user 
waveforms. 

[0099] Related aspects of the invention provide methods 
as described above in which a physical user waveforms is 
decomposed into a virtual user waveform that represents one 
or more respective control or data bits of a symbol encoded 
in the respective physical user waveform. 

[0100] Other related aspects provide for generating the 
cross-correlations in the form of a first matrix, e.g., an 
R-matrix for the virtual user waveforms. That matrix can, 
according to still further related aspects of the invention, be 
used to generate a second matrix representing cross-corre- 
lations of the physical user waveforms. This second matrix 
is generated, in part, as a function of a vector indicating the 
mapping of virtual user waveforms to physical user wave- 
forms. 

[0101] Further aspects of the invention provide a system 
for detecting symbols encoded in physical user waveforms 
that has multiple processors, e.g., each with an associated 
vector processor, that operates in accord with the foregoing 
methods to generate estimates of the symbols encoded in the 
physical user waveforms. 

[0102] Still other aspects of the invention provide a system 
for detecting user transmitted symbols encoded in short- 



code spread spectrum waveforms that generates cross-cor- 
relations among the waveforms as a function of block- 
floating integer representations of one or more 
characteristics of those waveforms. Such a system, accord- 
ing to related aspects of the invention, utilizes a central 
processing unit to form floating-point representations of 
virtual user waveform characteristics into block-floating 
integer representations. A vector processor, according to 
further related aspects, generates the cross-correlations from 
the latter representations. The central processing unit can 
"reformat" the resulting block-floating point matrix into 
floating-point format, e.g., for use in generating symbol 
estimates. 

[0103] Still further aspects of the invention provide meth- 
ods and apparatus employing any and all combinations of 
the foregoing. These and other aspects of the invention, 
which includes combinations of the foregoing, are evident in 
the illustrations and in the text that follows. 

BRIEF DESCRIPTION OF THE ILLUSTRATED 
EMBODIMENT 

[0104] A more complete understanding of the invention 
may be attained by reference to the drawings, in which: 

[0105] FIG. 1 is a block diagram of components of a 
wireless base-station utilizing a multi-user detection appa- 
ratus according to the invention; 

[0106] FIG. 2 is a block diagram of components of a 
multiple user detection processing card according to the 
invention; 

[0107] FIG. 3 is a more detailed view of the processing 
board of FIG. 2; 

[0108] FIG. 4 depicts a majority- voter sub-system in a 
system according to the invention; 

[0109] FIG. 5 is a block diagram of an integrated direct 
memory access (DMA) engine of the type used in a system 
according to the invention; 

[0110] FIGS. 6 and 7 depict power on/off curves for the 
processor board in a system according to the invention; 

[0111] FIG. 8 are an operational overview of functionality 
within the host processor and multiple compute nodes in a 
system according to the invention; 

[0112] FIG. 9 is a block diagram of an external digital 
signal processor apparatus used to supply digital signals to 
the processor board in a system according to the invention; 

[0113] FIG. 10 illustrates an example of loading the R 
matrices on multiple compute nodes in a system according 
to the invention; 

[0114] FIG. 11 depicts a short-code loading implementa- 
tion with parallel processing of the matrices in a system 
according to the invention; 

[0115] FIG. 12 depicts a long-code loading implementa- 
tion utilizing pipelined processing and a triple-iteration of 
refinement in a system according to the invention; 

[0116] FIG. 13 illustrates skewing of multiple user wave- 
forms; 

[0117] FIG. 14 is a graph illustrating MUD efficiency as 
a function of user velocity in units of Km/hr. 
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[0118] FIG. 15 schematically illustrates a method for 
defining a common interval for three short-code streams 
utilized in a FFT calculation of the T-matrix; 

[0119] FIG. 16 schematically illustrates the T-matrix ele- 
ments calculated upon addition of a new physical user to a 
system according to the invention; 

[0120] FIGS. 17, 18 and 19 depict hardware calculation of 
the r-matrix in a system according to the invention; 

[0121] FIG. 20 illustrates parallel computation of the R 
and C matrices in a system according to the invention; 

[0122] FIG. 21 depicts a use of a vector processor using 
integer operands for generating a cross-correlation matrix of 
virtual user waveforms in a system according to the inven- 
tion. 

DETAILED DESCRIPTION OF THE 
ILLUSTRATED EMBODIMENT 

[0123] Code-division multiple access (CDMA) wave- 
forms or signals transmitted, e.g., from a user cellular phone, 
modem or other CDMA signal source, can become distorted 
by, and undergo amplitude fades and phase shifts due to 
phenomena such as scattering, diffraction and/or reflection 
off buildings and other natural and man-made structures. 
This includes CDMA, DS/CDMA, IS-95 CDMA, CDMA- 
One, CDMA2000IX, CDMA2000 lxEV-DO,WCDMA(or 
UTMS), and other forms of CDMA, which are collectively 
referred to hereinafter as CDMA or WCDMA. Often the 
user or other source (collectively, "user") is also moving, 
e.g., in a car or train, adding to the resulting signal distortion 
by alternately increasing and decreasing the distances to and 
numbers of building, structures and other distorting factors 
between the user and the base station. 

[0124] In general, because each user signal can be dis- 
torted several different ways en route to the base station or 
other receiver (hereinafter, collectively, "base station"), the 
signal may be received in several components, each with a 
different time lag or phase shift. To maximize detection of a 
given user signal across multiple tag lags, a rake receiver is 
utilized. Such a receiver is coupled to one or more RF 
antennas (which serve as a collection point(s) for the time- 
lagged components) and includes multiple fingers, each 
designed to detect a different multipath component of the 
user signal. By combining the components, e.g., in power or 
amplitude, the receiver permits the original waveform to be 
discerned more readily, e.g., by downstream elements in the 
base station and/or communications path. 

[0125] A base station must typically handle multiple user 
signals, and detect and differentiate among signals received 
from multiple simultaneous users, e.g., multiple cell phone 
users in the vicinity of the base station. Detection is typically 
accomplished through use of multiple rake receivers, one 
dedicated to each user. This strategy is referred to as single 
user detection (SUD). Alternately, one larger receiver can be 
assigned to demodulate the totality of users jointly. This 
strategy is referred to as multiple user detection (MUD). 
Multiple user detection can be accomplished through vari- 
ous techniques which aim to discern the individual user 
signals and to reduce signal outage probability or bit-error 
rates (BER) to acceptable levels. 

[0126] However, the process has heretofore been limited 
due to computational complexities which can increase expo- 



nentially with respect to the number of simultaneous users. 
Described below are embodiments that overcome this, pro- 
viding, for example, methods for multiple user detection 
wherein the computational complexity is linear with respect 
to the number of users and providing, by way of further 
example, apparatus for implementing those and other meth- 
ods that improve the throughput of CDMA and other spread- 
spectrum receivers. The illustrated embodiments are imple- 
mented in connection with short-code CDMA transmitting 
and receiver apparatus; however those skilled in the art will 
appreciate that the methods and apparatus therein may be 
used in connection with long-code and other CDMA signal- 
ling protocols and receiving apparatus, as well as with other 
spread spectrum signalling protocols and receiving appara- 
tus. In these regards and as used herein, the terms long-code 
and short-code are used in their conventional sense: the 
former referring to codes that exceed one symbol period; the 
latter, to codes that are a single symbol period or less. 

[0127] FIG. 1 depicts components of a wireless base 
station 100 of the type in which the invention is practiced. 
The base station 100 includes an antenna array 114, radio 
frequency/intermediate frequency (RF/IF) analog-to-digital 
converter (ADC), multi-antenna receivers 110, rake modems 
112, MUD processing logic 118 and symbol rate processing 
logic 120, coupled as shown. 

[0128] Antenna array 114 and receivers 110 are conven- 
tional such devices of the type used in wireless base stations 
to receive wideband CDMA (hereinafter "WCDMA") trans- 
missions from multiple simultaneous users (here, identified 
by numbers 1 through K). Each RF/IF receiver (e.g., 110) is 
coupled to antenna or antennas 114 in the conventional 
manner known in the art, with one RF/IF receiver 110 
allocated for each antenna 114. Moreover, the antennas are 
arranged per convention to receive components of the 
respective user waveforms along different lagged signal 
paths discussed above. Though only three antennas 114 and 
three receivers 110 are shown, the methods and systems 
taught herein may be used with any number of such devices, 
regardless of whether configured as a base station, a mobile 
unit or otherwise. Moreover, as noted above, they may be 
applied in processing other CDMA and wireless communi- 
cations signals. 

[0129] Each RF/IF receiver 110 routes digital data to each 
modem 112. Because there are multiple antennas, here, Q of 
them, there are typically Q separate channel signals com- 
municated to each modem card 112. 

[0130] Generally, each user generating a WCDMA signal 
(or other subject wireless communication signal) received 
and processed by the base station is assigned a unique 
short-code code sequence for purposes of differentiating 
between the multiple user waveforms received at the bas- 
estation, and each user is assigned a unique rake modem 112 
for purposes of demodulating the user's received signal. 
Each modem 112 may be independent, or may share 
resources from a pool. The rake modems 112 process the 
received signal components along fingers, with each 
receiver discerning the signals associated with that receiv- 
er's respective user codes. The received signal components 
are denoted here as r^Jt] denoting the channel signal (or 
wave-form) from the k" 1 user from the q" 1 antenna, or rjt] 
denoting all channel signals (or wave-forms) originating 
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from the k* user, in which case rjt] is understood to be a 
column vector with one element for each of the Q antennas. 
The modems 112 process the received signals rjt] to 
generate detection statistics y k (0) [m] for the k* user for the 
mth symbol period. To this end, the modems 122 can, for 
example, combine the components rjt] by power, ampli- 
tude or otherwise, in the conventionalmanner to generate 
the respective detection statistics y k (0) [m]. In the course of 
such processing, each modem 112 determines the amplitude 
(denoted herein as a) of and time lag (denoted herein as -c) 
between the multiple components of the respective user 
channel. The modems 112 can be constructed and operated 
in the conventional manner known in the art, optionally, as 
modified in accord with the teachings of some of the 
embodiments below. 

[0131] The modems 112 route their respective user detec- 
tion statistics y k (o;) [m], as well as the amplitudes and time 
lags, to common user detection (MUD) 118 logic con- 
structed and operated as described in the sections that 
follow. The MUD logic 118 processes the received signals 
from each modem 112 to generate a refined output, y k (n) [m], 
or more generally, y^fm], where n is an index reflecting 
the number of times the detection statistics are iteratively or 
regeneratively processed by the logic 118. Thus, whereas the 
detection statistic produced by the modems is denoted as 
y k (0) [m] indicating that there has been no refinement, those 
generated by processing the y k ^[m] detection statistics with 
logic 118 are denoted y k (1) [m], those generated by process- 
ing the y k (1) [ml detection statistics with logic 118 are 
denoted yj 2) [m], and so forth. Further waveforms used and 
generated by logic 118 are similarly denoted, e.g., r (n) [t]. 

[0132] Though discussed below are embodiments in 
which the logic 118 is utilized only once, i.e., to generate 
y k (1 ^8 m] from y k <°*[m], other embodiments may employ 
that logic 118 multiple times to generate still more refined 
detection statistics, e.g., for wireless communications appli- 
cations requiring lower bit error rates (BER). For example, 
in some implementations, a single logic stage 118 is used for 
voice applications, whereas two or more logic stages are 
used for data applications. Where multiple stages are 
employed, each may be carried out using the same hardware 
device (e.g., processor, co-processor or field programmable 
gate array) or with a successive series of such devices. 

[0133] The refined user detection statistics, e.g., y k (1) [m] 
or more generally y k (n) [M], are communicated by the MUD 
process 118 to a symbol process 120. This determines the 
digital information contained within the detection statistics, 
and processes (or otherwise directs) that information accord- 
ing to the type of user class for which the user belongs, e.g., 
voice or data user, all in the conventional manner. 

[0134] Though the discussion herein focuses on use of 
MUD logic 118 in a wireless base station, those skilled in the 
art will appreciate that the teachings hereof are equally 
applicable to MUD detection in any other CDMA signal 
processing environment such as, by way of non-limiting 
example, cellular phones and modems. For convenience, 
such cellular base stations other environments are referred to 
herein as "base stations/' Multiple User Detection Process- 
ing Board 

[0135] FIG. 2 depicts a multiple user detection (MUD) 
processing card according to the invention. The illustrated 
processing card 118 includes a host processor 202, an 
interface block 204, parallel processors 208, a front panel 



device 210, and a multi-channel cross-over device 206 
(hereinafter "Crossbar"). Although these components are 
shown as separate entities, one skilled in the art can appre- 
ciate that different configurations are possible within the 
spirit of the invention. For example, the host processor 202 
and the interface block 204 can be integrated into a single 
assemble, or multiple assemblies. 

[0136] The processing card 118 processes waveform and 
waveform components received by a base station, e.g., from 
a modem card 112 or receiver 110 contained within the base 
station, or otherwise coupled with the base station. The 
waveform typically includes CDMA wave-forms, however 
the processing card 118 can also be configured for other 
protocols, such as TDMA and other multiple user commu- 
nication techniques. The processing card 118 performs mul- 
tiple user detection (MUD) on the waveform data, and 
generates a user signal corresponding to each user, with 
includes less interference than within the received signals. 

[0137] The illustrated processing card 118 is a single 
board assembly and is manufactured to couple (e.g., elec- 
trically and physically mate) with a conventional base 
station (e.g., a modem card 112, receiver 110 or other 
component). The board assembly illustrated conforms to a% 
form factor modem payload card of the type available in the 
marketplace. The processor card 118 is designed for retro- 
fitting into existing base stations or for design into new 
station equipment. In other embodiments, the processing 
card can be either single or multiple assemblies. 

[0138] The host processor 202 routes data from the inter- 
face block 204 to and among the parallel processors 208, as 
well as performs fault monitoring and automated resets, data 
transfer, and processor loading of the parallel processors 
208. The host processor 202 also processes output received 
from the parallel processors 208, and communicates the 
processed output to the interface block 204 for subsequent 
return to the base station. 

[0139] The parallel processors 202 process waveforms 
and waveform components routed from the host processor 
206. Typically, the parallel processors 202 process the 
waveform components, and communicate the processed data 
back to the host processor 202 for further processing and 
subsequent transmission to the base station, however, the 
intermediate processed waveforms can be communicated to 
other parallel processors or directly to the base station. 

[0140] The crossbar 206 is a communication switch which 
routes messages between multiple devices. It allows mul- 
tiple connection data ports to be connection with other data 
ports. In the illustrated embodiment, the crossbar 206 pro- 
vides eight ports, where a port can be "connected" to any 
other port (or to multiple ports) to provide communication 
between those two (or indeed, multiple) ports. Here, the 
crossbar 206 is a RACEway™ switch of the type commer- 
cially available from the assignee hereof. In other embodi- 
ments, other switching elements, whether utilizing the 
RACEway™ protocol or otherwise, may be used, e.g., PCI, 
I2C and so on. Indeed, in some embodiments, the compo- 
nents communicate along a common bus and/or are distrib- 
uted via over a network. 

[0141] A front panel 210 is used to monitor the processor 
card and can be used to apply software patches, as well as 
perform other maintenance operations. Additionally, the 
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front panel 210 can be used to monitor fault status and 
interface connections through a series of LED indicators, or 
other indicators. Illustrated front panel interfaces with the 
board via the RACEway™ switch and protocol, though 
other interface techniques may be used as well. 

[0142] FIG. 3 depicts further details of the processor card 
of FIG. 2. The illustrated processor card includes a host 
processor 202 in communication with an interface block 205 
and a set of parallel processors 208 (hereinafter "compute 
elements") as described above, as well as a crossbar 206 and 
a front panel 210. Further, a power status/control device 240 
is assembled on the processor card 118. However, in other 
embodiments, the power status/control device 240 can be 
within the base station or elsewhere. 

[0143] The host processor 202 includes a host controller 
203 with an integrated processor containing a peripheral 
logic block and a 32-bit processor core. The host controller 
203 is coupled with various memory devices 205, a real time 
clock 206, and a protocol translator 208. In the illustrated 
embodiment, the host controller 203 can be a Motorola 
PowerPC 8240 commercially available, but it will be appre- 
ciated by one skilled in the art that other integrated proces- 
sors (or even non-integrated processors) can be used which 
satisfy the requirements herein. 

[0144] The host controller 203 controls data movement 
within the processor card 118 and between the processor 
card and the base station. It controls the crossbar device 206 
by assigning the connection between connection ports. Fur- 
ther, the host controller 203 applies functionality to the 
output generated by the parallel processors 208. The host 
controller 203 includes a monitor/watchdog sub-system 
which monitors the perform ace of the various components 
within the processor card, and can issue resets to the 
components. In some embodiments, these functions can be 
provided (or otherwise assisted) by application specific 
integrated circuits or field programmable gate arrays. 

[0145] The host controller 203 integrates a PCI bus 211a, 
2116 for data movement with the memory devices 205 and 
the interface block 205, as well as other components. The 
PCI bus 211a, 211b is capable of 32-bit or 64-bit data 
transfers operating at 33 MHz, or alternatively 66 MHz 
speeds, and supports access to PCI memory address spaces 
using either (or both) little and/or big endian protocols. 

[0146] Memory devices used by the host controller 203 
include HA Registers 212, synchronous dynamic random 
access memory (SDRAM) 214, Flash memory 216, and 
Non-Volatile Ram (NVRAM) 218. As will be evident below, 
each type of memory is used for differing purposes. 

[0147] The HA registers 212 store operating status (e.g., 
faults) for the parallel processors 208, the power status/ 
control device 240, and other components. A fault monitor- 
ing sub-system "watchdog" writes both software and hard- 
ware status into the HA registers 212, from which the host 
controller 203 monitors the registers 212 to determine the 
operational status of the components. The HA registers 212 
are mapped into banked memory locations, and are thereby 
addressable as direct access registers. In some embodiments, 
the HA registers 212 can be integrated with the host con- 
troller 203 and still perform the same function. 

[0148] The SDRAM 214 stores temporary application and 
data. In the illustrated embodiment, there is 64 Kbytes of 



SDRAM 214 available to support transient data, e.g., inter- 
mediary results from processing and temporary data values. 
The SDRAM 214 is designed to be directly accessed by the * 
host controller 203 allowing for fast DMA transfers. 

[0149] The flash memory 216 includes two Intel 
StrataFlash devices, although equivalent memory devices 
are commercially available. It stores data related to compo- 
nent performance data, and intermediate data which can be 
used to continue operation after resets are issued. The flash 
memory is blocked at 8 Kbyte boundaries, but in other 
embodiments, the block size can vary depending on the 
addressing capabilities of the host controller 203 and method 
of communication with the memory devices. Further, 
because flash memory requires no power source to retain 
programmed memory data, its data can be used for diag- 
nostic purposes even in the event of power-failures. 

[0150] NVRAM is, to an extent, reserved for fault record 
data and configuration information. Data stored within the 
NVRAM 218, together with the flash memory 216 is suffi- 
cient to reproduce the data within the SDRAM 218 upon 
system (or board level, or even component level) reset. If a 
component is reset during operation, the host controller 203 
can continue operation without the necessity of receiving 
additional information from the base station via the data 
stored in the NVRAM. The NVRAM 218 is coupled to the 
host controller 203 via a buffer which converts the voltage 
of the PCI bus 211a from 3.3 v to 5 v, as required by the 
NVRAM 218, however this conversion is not necessary in 
other embodiments with different memory configurations. 

[0151] The interface block 205 includes a PCI bridge 222 
in communication with an Ethernet interface 224 and a 
modem connection 226. The PCI bridge 222 translates data 
received from the PCI bus 211b into a protocol recognized 
by the base station modem card 112. Here, the modem 
connection 226 operates with a 32-bit interface operating at 
66 MHz, however, in other embodiments the modem can 
operate with different characteristics. The Ethernet connec- 
tion 224 can operate at either 10 Mbytes/Sec or 100 Mbytes/ 
Sec, and is therefore suited for most Ethernet devices. Those 
skilled in the art can appreciate that these interface devices 
can be interchanged with other interface devices (e.g., LAN, 
WAN, SCSI and the like). 

[0152] The real-time clock 206 supplies timing for the 
host controller 203 and the parallel processors 208, and thus, 
synchronizes data movement within the processing card. It 
is coupled with the host controller 203 via an integrated I2C 
bus (as established by Phillips Corporation, although in 
other embodiments the clock can be connected via other 
electrical coupling). The real-time clock 206 is implemented 
as a CMOS device for low power consumption. The clock 
generates signals which control address and data transfers 
within the host controller 203 and the multiple processors 
208. 

[0153] A protocol converter 208 (hereinafter "PXB") con- 
verts PCI protocol used by the host controller 203 to 
RACEway™ protocol used by the parallel processors 208 
and front panel 210. The PXB 208 contains a field program- 
mable gate array ("FPGA") and EEPROM which can be 
programmed from the PCI bus 211b. In some embodiments, 
the PXB 208 is programmed during manufacture of the 
processing card 118 to contain configuration information for 
the related protocols and/or components with which it 
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communicates. In other embodiments, the PXB 208 can use 
other protocols as necessary to communicate with the mul- 
tiple processors 208. Of course, if the host controller 203 and 
the multiple processors 208 use the same protocol, there is 
no protocol conversion necessary and therefore the PXB is 
not required. 

[0154] The multiple-port communication device 206 
(hereinafter "crossbar") provides communication between 
all processing and input/output elements on the processing 
card 118. In the illustrated embodiment, the crossbar 206 is 
an EEPROM device which can be read and programmed by 
a RACEway™ compatible component (e.g., the front panel 
210 or parallel processors 208), but it is typically pro- 
grammed initially during manufacture. An embedded ASIC 
device controls the EEPROM programming, and hence, the 
function of the crossbar 206. 

[0155] The crossbar 206 in the illustrated provides up to 
three simultaneous 266 -Mbytes/Sec throughput data paths 
between elements for a total throughput of 798 Mbytes/Sec, 
however, in other embodiments the actual throughput varies 
according to processing speed. Here, two crossbar ports 
(e.g., ports 0 and 1) connect to a bridge FPGA which further 
connect to the front panel 210. Each of the multiple proces- 
sors use an crossbar port (e.g., ports 2, 3, 5, and 6), and the 
interface block 224 and host controller 203 share one 
crossbar port (e.g., port 4) via the PXB 206. The number of 
ports on the crossbar 206 depends on the number of parallel 
processors and other components that are in communication. 

[0156] The multiple processors 208 in the illustrated 
embodiment include four compute elements 220a-220d 
(hereinafter, reference to element 220 refers to a general 
compute element, also referred to herein as a "processing 
element** or "CE"). Each processing element 220 applies 
functionality on data, and generates processed date in the 
form of a matrix, vector, or waveform. The processing 
elements 220 can also generate scalar intermediate values. 
Generated data is passed to the host controller 208, or to 
other processing elements 220 for further processing. Fur- 
ther, individual processing elements can be partitioned to 
operate in series (e.g., as a pipeline) or in parallel with the 
other processing elements. 

[0157] Aprocessing element 220 includes a processor 228 
coupled with a cache 230, a Joint Test Action Group 
(hereinafter "JTAG") interface 232 with an integrated pro- 
gramming port, and an application specific integrated circuit 
234 (hereinafter "ASIC"). Further, the ASIC 234 is coupled 
with a 128 Mbyte SDRAM device 236 and HARegisters 
238. The HARegisters are coupled with 8 Kbytes of 
NVRAM 244. In the illustrated embodiment the compute 
elements 220 are on the same assembly as the host controller 
203. In other embodiments, the compute nodes 220 can be 
separate from the host controller 203 depending on the 
physical and electrical characteristics of the target base 
station. 

[0158] The compute node processors 228 illustrated are 
Motorola PowerPC 7400, however in other embodiments 
the processor can be other processor devices. Each processor 
228 uses the ASIC 234 to interface with a RACEway™ bus 
246. The ASIC 234 provides certain features of a compute 
node 220, e.g., a DMA engine, mail box interrupts, timers, 
page mapping registers, SDRAM interface and the like. In 
the illustrated embodiment the ASIC is programmed during 



manufacture, however, it can also be programmed in the 
field, or even at system reset in other embodiments. 
[0159] The cache 230 for each compute node 220 stores 
matrices that are slow-changing or otherwise static in rela- 
tion to other matrices. The cache 230 is pipelined, single- 
cycle deselect, synchronous burst static random access 
memory, although in other embodiments high-speed RAM 
or similar devices can be used. The cache 230 can be 
implemented using various devices, e.g., multiple 64 Kbyte 
devices, multiple 256 Kbyte devices, and so on. 

Architecture Pairing of Processing Nodes with 
NVRAM and Watchdog; Majority Voter 

[0160] The HA registers 238 store fault status for the 
software and/or hardware of the compute element 220. As 
such, it responds to the watchdog fault monitor which also 
monitors the host controller 203 and other components. The 
NVRAM 244 is, much like the NVRAM coupled with the 
host controller 203, stores data from which the current state 
of the compute element 220 can be recreated should a fault 
or reset occur. The SDRAM 236 is used for intermediate and 
temporary data storage, and is directly addressable from 
both the ASIC 234 and the processor 228. These memory 
devices can be other devices in other embodiments, depend- 
ing on speed requirements, throughput and computational 
complexity of the multiple user detection algorithms. 

[0161] NVRAM is also used to store computational vari- 
ables and data such that upon reset of the processing element 
or host controller, execution can be re -started without the 
need to refresh the data. Further, the contents of NVRAM 
can be used to diagnose fault states and/or conditions, thus 
aiding to a determination of the cause of fault state. 

[0162] As noted above, a "watchdog" monitors perfor- 
mance of the processing card 118. Id the illustrated embodi- 
ment, there are five independent "watchdog" monitors on 
the processing card 118 (e.g., one for the host controller 203 
and one each for each compute node 220a -220d, and so on). 
The watchdog also monitors performance of the PCI bus as 
well as the RaceWay bus connected with each processing 
element and the data switch. The RACEWay bus includes 
out-of-band fault management coupled with the watchdogs. 

[0163] Each component periodically strobes its watchdog 
at least every 20 msec but not faster that 500 microseconds 
(these timing parameters vary among embodiments depend- 
ing on over-all throughput of the components and clock 
speed). The watchdog is initially strobed approximately two 
seconds after the initialization of a board level reset, which 
allows for start-up sequencing of the components without 
cycling erroneous resets. Strobing the watchdog for the 
processing nodes is accomplished by writing a zero or a one 
sequence to a discrete word (e.g., within the HA Register 
212) originating within each compute element 220a -220c/, 
the host controller 203, and other components). The watch- 
dog for the host controller 203 is serviced by writing to the 
memory mapped discrete location FFF_D027 which is con- 
tained within the HA Registers 212. 

[01 64] The watchdog uses five 8-bit status registers within 
the HA registers 212, and additional registers (e.g., HA 
registers 238) within each compute node 220. One register 
represents the host controller 203 status, and the other four 
represent each compute node 220a-220d status. Each reg- 
ister has a format as follows: 
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Bit Name 


Dcscxiption 


0 


CHECKSTOP_OUT 


Checkstop state of CPU 






(0 » CPU in checkstop) 


1 


WDM_FAULT 


WDM failed (0 - WDM failed, 






set high aftei reset and valid service) 


2 


SOFIWARE_FAULT 


(Set to 0 when a software exception 






was detected) (R/W local) 


3 


RESETREQ_IN 


Wrap status of the local CPU's reset request 


4 


WDMJNTT 


WDM failed in initial 2 second window 






(0 = WDM failed) 


5 


Software definable 0 


Software definable 0 


6 


Software definable 1 


Software definable 1 


7 


Unused 


Unused 



[0165] The five registers reflect status information for all 
processors within the processing board 118, and allow the 
host controller 203 to obtain status of each without the need 
for polling the processor individually (which would degrade 
performance and throughput). Additionally, the host con- 
troller 203 and each compute node processor 228 has a fault 
control register which contains fault data according to the 
following format: 



Bit Name Description 

0 RESETREQ__OUT_0 Request a reset event (0 »> forces reset) 

1 CHKST0POUT_0 Request that node 0 enter checkstop state 

(0 ■> request checkstop) 

2 CHKSTOPOUT_l Request that node 1 enter checkstop state 

(0 -> request checkstop) 

3 CHKSTOPOUT_2 Request that node 2 enter checkstop state 

(0 ■> request checkstop) 

4 CHKSTOPOUT_3 Request that node 3 enter checkstop state 

(0 -> request checkstop) 

5 CHKST0POUT_8240 Request that the host controller enter 

checkstop state (0 => request checkstop) 

6 Software definable 0 Software definable 0 

7 Software definable 1 Software definable 1 



[0166] A single write of any value will strobe the watch- 
dog. Upon events such as power-up, the watchdogs are 
initialized to a fault state. Once a valid strobe is issued, the 
watchdog executes and, if all elements are properly operat- 
ing, writes a no-fault state to the HA register 212. This 
occurs within the initial two-second period after board level 
reset. If a processor node fails to service the watchdog 
within the valid time frame, the watchdog records a fault 
state. A watchdog of a compute node 220 in fault triggers an 
interrupt to the host controller 203. If a fault is within the 
host controller 203, then the watchdog triggers a reset to the 
board. The watchdog then remains in a latched failed state 
until a CPU reset occurs followed by a valid service 
sequence. 

[0167] Each processor node ASIC 234 accesses a DIAG3 
signal that is wired to an HA register, and is used to strobe 
the compute element's hardware watchdog monitor. A 
D1AG2 signal is wired to the host processor's embedded 
programmable interrupt controller (EPIC) and is used by a 
compute element to generate a general purpose interrupt to 
the host controller 203. 

[0168] A majority voter (hereinafter "voter") is a dual 
software sub-system state machine that identifies faults 



within each of the processors (e.g., the host controller 230 
and each compute node 220a-220</) and also of the proces- 
sor board 118 itself. The local voter can reset individual 
processors (e.g., a compute node 220) by asserting a 
CHECKSTOP_IN to that processor. The board level voter 
can force a reset of the board by asserting a master reset, 
wherein all processors are reset. Both voters follow a rule set 
that the output will follow the majority of non-checkstopped 
processors. If there are more processors in a fault condition 
than a oon-fault condition, the voter will force a board reset. 
Of course, other embodiments may use other rules, or can 
use a single sub -system to accomplish the same purpose. 

[0169] A majority voter is illustrated in FIG. 4. Board 
level resets are initiated from a variety of sources. One such 
source is a voltage supervisor (e.g., the power status/control 
device 240) which can generate a 200 ms reset if the voltage 
(e.g., VCC) rises above a predetermined threshold, such as 
4.38 volts (this is also used in the illustrated embodiment in 
a pushbutton reset switch 406, however, the push button can 
also be a separate signal). The board level voter will con- 
tinue to drive a RESET_0 408 until both the voltage 
supervisor 404 and the PCI_RESET-0 410 are de-asserted. 
Either reset will generate the signal RESET_0 412 which 
resets the card into a power-on state. RESET-0 412 also 
generates HRESET_0 414 and TRST 416 signals to each 
processor. Further, a HRESET_0 and TRSTcan be gener- 
ated by the JTAG ports using a JTAG_HRESET_0 418 and 
JTAG TRST 420 respectively. The host controller 203 can 
generate a reset request, a soft reset (C_SRESET_0 422) to 
each processor, a check-stop request, and an ASIC reset 
(CE_RESET_0 424) to each of the four compute element's 
ASIC. A discrete word from the 5 v-powered reset PLD will 
generate the signal NPORESET_l (not a power on reset). 
This signal is fed into the host processor discrete input word. 
The host processor will read this signal as logic low only if 
it is coming out of reset due to either a power condition or 
an external reset from off board. Each compute element, as 
well as the host processor can request a board level reset. 
These requests are majority voted, and the result RESET- 
VOTE__0 will generate a board level reset. 

[0170] Each compute node processor 228 has a hard reset 
signal driven by three sources gated together: a HRESET_0 
pin 426 on each ASIC, a HRESET_0 418 from the JTAG 
connector 232, and a HRESET_0 412 from the majority 
voter. The HRESET__0 pin 426 from the ASIC is set by the 
"node run" bit field (bit 0) of the ASIC Miscon_A register. 
Setting HRESET__0 426 low causes the node processor to 
be held in reset. HRESET_0 426 is low immediately after 
system reset or power-up, the node processor is held in reset 
until the HRESET_0 line is pulled high by setting the node 
run bit to 1. The JTAG HRESET_0 418 is controlled by 
software when a JTAG debugger module is connected to the 
card. The HRESET_0 412 from the majority voter is 
generated by a majority vote from all healthy nodes to reset. 

[0171] When a processor reset is asserted, the compute 
processor 228 is put into reset state. The compute processor 
228 remains in a reset state until the RUN bit 0 of the 
Miscon_A register is set to 1 and the host processor has 
released the reset signals in the discrete output word. The 
RUN bit is set to 1 after the boot code has been loaded into 
the SDRAM starting at location Ox 0000_J)100. The ASIC 
maps the reset vector Ox FFF0_0100 generated, by the 
MPC7400 to address 0x0000_0100. 
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[0172] Turning now to discuss memory devices 205 
coupled with the host controller 203, the memory devices 
are addressable by the host controller 203 as follows. The 
host controller 203 addresses the memory devices (e.g., the 
HA registers 212, SDRAM 214, Flash 216 and NVRAM 
218) using two address mapping configurations designated 
as address map A and address map B, although other 
configurations are possible. Address map A conforms to the 
PowerPC reference platform (PreP) specification (however, 
if other host controllers are used, map A conforms with a 
native reference platform to that host controller). Address 
map B conforms to the host controller 203 common hard- 
ware reference platform (CHRP). 

[0173] Support of map A is provided for backward com- 
patibility, and further supports any retrofitting of existing 
base station configurations. The address space of map B is 
divided into four areas: system memory, PCI memory, PCI 
Input/Output (I/O), and system ROM space. When config- 
ured for map B, the host controller translates addresses 
across the internal peripheral logic bus and the external PCI 
bus as follows: 



-continued 



Bank 



Select 



Processor Core Address 



Range 



Definition 



FFFF_D014 


FFFF_D014 


IC (Interrupt input low) 


FFFF_D015 


FFFF_JX)15 


Unused (read FF) 


FFFF_D016 


FFFF_D016 


Unused (read FF) 


FFFF_D017 


FFFF_D017 


Unused (read FF) 


FFFF_D018 


FFFF_D018 


Unused (read FF) 


FFFF_D019 


FFFF__D019 


Unused (read FF) 


FFFF_D020 


FFFF_D020 


HA (Local HA register) 


FFFF_D021 


FFFF_J)021 


HA (Node 0 HA register) 


FFFF_D022 


FFFF__D022 


HA (Node 1 HA register) 


FFFF_D023 


FFFF_D023 


HA (Node 2 HA register) 


FFFF_JD024 


FFFF_X>024 


HA (Node 3 HA register) 


FFFF_D025 


FFFF__D025 


HA (8240 HA register) 


FFFF_D026 


FFFF_P026 


HA (Software Fail) 


FFFF_D027 


FFFFJ>027 


HA (Watchdog Strobe) 


FFFF_D028 


FFFF _DFFF 


4068 Bytes Rash 


FFFF_E000 


FFFF_FFFF 


8 K NVRAM 



Processor Core Address Range 
Hex Decimal 



PCI Address Range 



Definition 



0000_J)000 0009_FFFF 0 640K - 1 NO PCI CYCLE 

OOOA_0000 000F_J 3 FFF 640K 1M-1 00QA_0 000-0 00F_FFFF 

0010_0000 3FFF_FFFF 1M 10-1 NO PCI CYCLE 

4000_0000 7FFF_FFFF 1G 2G-1 NO PCI CYCLE 

8000_J)000 FCFF_FFFF 2G 4G-48M-1 8000_0000-FCFF_FFFF 

FD00_0000 FDFF __FFFF 4G-48M 4G-32M-1 000G_000O-00FF_FFFF 

FEOO_0000 FE7F __FFFF 4G-32M 4G-24M-1 0000_000O-O07F_JTFF 

FE80_0000 FEBF_FFFF 4G-24M 4G-20M-1 0080_000(W)OBF_FFFF 

FECO_0000 FEDF_FFFF 4G-20M 4G-18M-1 CONFIG_ADDR 

FEEO_0000 FEEF __FFFF 4G-18M 4G-17M-1 CONFIG_J)ATA 

FEF0_000O FEFF_FFFF 4G-17M 4G-16M-1 FEFO__0000-FEFF_FFFF 

FFOO_0000 FF7F_FFFF 4G-16M 4G-8M-1 FF00_O00O-FF7F_FFFF 

FF80_OOOO FFFF __FFFF 4G-8M 4G-1 FF80_0000-FFFF_FFFF 



System memory 
Compatibility hole 
System memory 
Reserved 
PCI memory . 
PCI/ISA memory 
PCI/ISA I/O 
PCI I/O 

PCI configuration address 
PCI configuration data 
PCI interrupt acknowledge 
32/64-bit Flash/ROM 
8/32/64-bit Flash/ROM 



[0174] In the illustrated embodiment, hex address FF00_ 
0000 through FF7F_FFFF is not used, and hence, that bank 
of Flash ROM is not used. The address of FF80__0000 

[0175] through FFFF_FFFF is used, as the Flash ROM is 
configured in 8-bit mode and is addressed as follows: 



Bank 



Select 



11111 
11110- 
00001 
00000 

xxxx 



Processor Cotc Address 
Range 



Definition 



FFEO_0000 FFEF_FFFF Accesses Bank 0 

FFEO_0000 FFEF_FFFF Application' code (30 pages) 

FFEO_0000 FFEF_FFFF Application/boot code 

FFFO_0000 FFFF_CFFF Application/boot code 

FFFF__D000 FFFF__D000 Discrete input word 0 

FFFF_D001 FFFF_D001 Discrete input word 1 

FFFF _X>002 FFFF _D002 Discrete output word 0 

FFFF_JD003 FFFF_D003 Discrete output word 1 

FFFF_JX)04 FFFF__D004 Discrete output word 2 

FFFF_D030 FFFF__D010 IC (Pending interrupt) 

FFFF_J3011 FFFF__D011 IC (Interrupt mask low) 

FFFF_D012 FFFF_D012 IC (Interrupt clear low) 

FFFF_D033 FFFF_D013 IC (Unmasked, pending low) 



[0176] Address FFEF_0000 through FFEF_FFFF con- 
tains 30 pages, and is used for application and boot code, as 
selected by the Flash bank bits. Further, there a 2 Mbyte 
block available after reset. Data movement occurs on the 
PQ 211a and/or a memory bus. 

DMA Engine Supported by Host Controller and 
FPGA 

[0177] Direct memory access (DMA) is performed by the 
host controller 203, and operates independently from the 
host processor 203 core, as illustrated in FIG. 5. The host 
controller 203 has an integrated DMA engine including a 
DMA command stack 502, a DMA state engine 504, an 
address decode block 506, and three FIFO interfaces 508, 
510, 512. The DMA engine receives and sends information 
via the PXB 208 coupled with the crossbar 206. 

[0178] The command stack 502 and state machine 504 
processes DMA requests and transfers. The stack 502 and 
state machine 504 can initiate both cycle stealing and burst 
mode, along with host controller interupts. The address 
decode 506 sets the bus address, and triggers transmissions 
of the data. 
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[0179] The host controller 203 has two DMA I/O inter- 
faces, each with a 64-byte queue to facilitate the gathering 
and sending of data. Both the local processor and PCI 
masters can initiate a DMA transfer. The DMA controller 
supports memory transfers between PCI to memory, 
between local and PCI memory, and between local memory 
devices. Further, the host controller 203 can transfer in either 
block mode or scatter mode within discontinuous memory. 
A receiving channel 510 buffers data that is to be received 
by the memory. A transmit channel 512 buffers data that is 
sent from memory. Of course, the buffers can also send/ 
receive information from other devices, e.g., the compute 
uodes 220, or other devices capable of DMA transfers. 

[0180] The host controller 203 contains an embedded 
programmable interrupt controller (EPIC) device. The inter- 
rupt controller implements the necessary functions to pro- 
vide a flexible and general-purpose interrupt controller. 
Further, the interrupt controller can pool interrupts generated 
from the several external components (e.g., the compute 
elements), and deliver them to the processor core in a 
prioritized manner. In the illustrated embodiment, an Open- 
PIC architecture is used, although it can be appreciated by 
one skilled in the art that other such methods and techniques 
can be used. Here, the host controller 203 supports up to five 
external interrupts, four internal logic-driven interrupts, and 
four timers with interrupts. 

[0181] Data transfers can also take effect via the FPGA 
program 'interface 508. This interface can program and/or 
accept data from various FPGAs, e.g., the compute note 
ASIC 234, crossbar 242, and other devices. Data transfers 
within the compute node processor 228 to its ASIC 234 and 
RACEway™ bus 246 are addressed as follows: 



-continued 



Prom Address Tb Address 



Function 



0x0000 0000 OxOFFF FFFF Local SDRAM 256 Mb 

0x1000 0000 QxlFFF FFFF crossbar 256 Mb map window 1 

0x2000 0000 0x2FFF FFFF crossbar 256 MB map window 2 

0x3000 0000 0x3 FFF FFFF crossbar 256 MB map window 3 

0x4000 0000 0x4FFF FFFF crossbar 256 MB map window 4 

0x5000 0000 0x5 FFF FFFF crossbar 256 MB map window 5 

0x6000 0000 0x6FFF FFFF crossbar 256 MB map window 6 

0x7000 0000 0x7FFF FFFF crossbar 256 MB map window 7 

0x8000 0000 0x8FFF FFFF crossbar 256 MB map window 8 

0x9000 0000 0x9 FFF FFFF crossbar 256 MB map window 9 

OxAOOO 0000 OxAFFF FFFF crossbar 256 MB map window A 

OxBOOO 0000 OxBFFF FFFF crossbar 256 MB map window B 

OxCOOO 0000 OxCFFF FFFF crossbar 256 MB map window C 

OxDOOO 0000 OxDFFF FFFF crossbar 256 MB map window D 

OxEOOO 0000 OxEFFF FFFF crossbar 256 MB map window E 

OxFOOO 0000 OxFBFF FBFF Not used (CE reg replicated mapping) 

OxFBFF FC00 OxFBFF FDFF Internal CN ASIC registers 



From Address 


To Address 


Function 


OxFBFF FE00 


OxFEFF FFFF 


Pre- fetch control 


OxFFOO 0000 


OxFFFF FFFF 


16 MB boot FLASH memory area 



[0182] The SDRAM 236 can be addressable in 8, 16, 32 
or 64 bit addresses. The RACEway™ bus 246 supports 
locked read/write and locked read transactions for all data 
sizes. A 16 Mbyte boot flash area is further divided as 
follows: 



From Address 


To Address 


Function 


OxFFOO 2006 
OxFFOO 2005 
OxFFOO 2004 
OxFFOO 2003 
OxFFOO 2002 
OxFFOO 2001 
OxFFOO 2000 
OxFFOO 0000 


OxFFOO 2006 
OxFFOO 2005 
OxFFOO 2004 
OxFFOO 2003 
OxFFOO 2002 
OxFFOO 2001 
OxFFOO 2000 
OxFFOO IFFF 


Software Fail Register 

MPC8240 HA Register 

Node 3 HA Register 

Node 2 HA Register 

Node 1 HA Register 

Node 0 HA Register 

Local HA Register (status/control) 

NVRAM 


[0183] Slave accesses are accesses initiated by an external 
RACEway™ device directed toward the compute element 
processor 238. The ASIC 234 supports a 256 Mbyte address 
space which can be partitioned as follows: 


From Address 


Tb Address 


Function 


0x0000 0000 
0Xfff_FC0O 


OxOFFF FBFF 
OxFFF_J r FFF 


256 MB less 1 Kb hole SDRAM 
PCE133 internal registers 



[0184] There are 16 discrete output signals directly con- 
trollable and readable by the host controller 203. The 16 
discrete output signals are divided into two addressable 8-bit 
words. Writing to a discrete output register will cause the 
upper 8 -bits of the data bus to be written to the discrete 
output latch. Reading a discrete output register will drive the 
8-bit discrete output onto the upper 8-bits of the host 
processor data bus. The bits in the discrete output word are 
defined as follows: 

[0185] There are 16 discrete input signals accessible by 
the host controller 203. Reads from the discrete input 
address space will latch the state of the signals, and return 
the latched state of the discrete input signals to the host 
processor. The bits in the discrete input word are as follows: 



DH(0:7) Signal 



ND0_FLASH_EN_1 
ND1_FLASH_EN_1 
ND2_FLASH_EN_1 
ND3_FLASH__EN_1 
Wrap 1 



Description 
Output Word 2 

Enable the CE ASICs FLASH port when 1 
Enable the CE ASICs FLASH port when 1 
Enable the CE ASICs FLASH port when 1 
Enable the CE ASICs FLASH port when 1 
Wrap to discrete input 
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-continued 



DH(0:7) Signal 



Description 



WRAPO 

EC_RESrJT_0 

SWLED 

FLASHSEL4 

FLASHSEL3 

FLASHSEL2 

FLASHSEL1 

FLASHSELO 



C_SRBSET3_0 
C _PRESET3_j) 
C_SRESET2_0 
C_PRESET2_0 
C_SRESET1_0 
C_PRESET1_0 
C_SRESET0_0 
C_PRESETD_0 



WRAP1 

V3.3_FAIL_0 

V2.5_FAIL_0 

VCORE1_FAO_0 

VCORE0_FAIL_0 

RIOR_CNF_DONE_l 

FXB0_CNF_DONE__l 



WRAPO 

WDMSTATUS 

NPORESET_l 



Output Word 1 

Wrap to Discrete Input 
Reset the I2C serial bus when 0 
Software controlled LED 
Flash bank select address bit 4 
Flash bank select address bit 3 
Flash bank select address bit 2 
Flash bank select address bit 1 
Flash bank select address bit 0 
Output Word 0 

Issue a Soft Reset to CPU on Node 3 when 0 
Reset PCE133 ASIC Node 3 when 0 
Issue a Soft Reset to cpu on Node 2 when 0 
Reset PCE133 ASIC Node 2 when 0 
Issue a Soft Reset to cpu on Node 1 when 0 
Reset PCE133 ASIC Node 1 when 0 
Issue a Soft Reset to cpu on Node 0 when 0 
Reset PCE133 ASIC Node 0 when 0 
Input Word 1 

Wrap from discrete output word 

Latched status of power supply since last reset 
Latched status of power supply since last reset 
Latched status of power supply since last reset 
Latched status of power supply since last reset 
RIO/RACE++ FPGA configuration complete 
PXB++ FPGA configuration complete 
Input Word 0 

Wrap from discrete output word 

MPC8240's watchdog monitor status (0 - failed) 

Not a power on reset when high 



[0186] The host controller 203 interfaces with an 8-input 
interrupt controller external from processor itself (although 
in other embodiments it can be contained within the pro- 
cessor). The interrupt inputs are wired, through the control- 
ler to interrupt zero of the host processor external interrupt 
inputs. The remaining four host processor interrupt inputs 
are unused. 

[0187] The Interrupt Controller comprises the following 
five 8-bit registers: 



[0188] The interrupt input sources and their bit positions 
within each of the six registers are as follows: 



Bit Signal Description 

0 SWFAIL_0 8240 Software Controlled Fail Discrete 

1 RTC_INT_0 Real time clock event 



Resistor Description 

Pending Register A low bit indicates a falling edge was detected on that interrupt 
(read only); 

Clear Register Setting a bit low will clear the corresponding latched interrupt 

(write only); 

Mask Register Setting a bit low will mask the pending interrupt from generating 

a processor interrupt; 

Unmasked Pending A low bit indicates a pending interrupt that is not masked out 
Register 

Interrupt State Register indicates the actual logic level of each interrupt input pin. 
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•continued 



Bit 


Signal 


Description 


2 


NODE0_FAIL_0 


WDFAIl^O or IWDFAIL_0 or 
SWFAIL_0 active 


3 


NODE1_FAIL_0 


WDFAIL_0 or [WDFAIL__0 or 
SWFAIU_0 active 


4 


NODE2_FAIL_0 


WDFAIL_0 oi [WDFAIL_0 or 
SWFAIL_0 active 


5 


NODE3_FAIL_0 


WDFAIL_0 or IWDFAIL_Q or 
SWFAIL_0 active 


6 


PCLJNT_0 


PCI interrupt 


7 


XB_SYS_ERR_0 


crossbar internal error 



[0189] A falling edge on an interrupt input will set the 
appropriate bit in the pending register low. The pending 
register is gated with the mask register and any unmasked 
pending interrupts will activate the interrupt output signal to 
the host processor external interrupt input pin. Soft-ware 
will then read the unmasked pending register to determine 
which interrupts) caused the exception. Software can then 
clear the interrupt(s) by writing a zero to the corresponding 
bit in the clear register. If multiple interrupts are pending, the 
software has the option of either servicing all pending 
interrupts at once and then clearing the pending register or 
servicing the highest priority interrupt (software priority 
scheme) and the clearing that single interrupt. If more 
interrupts are still latched, the interrupt controller will gen- 
erate a second interrupt to the host processor for software to 
service. This will continue until all interrupts have been 
serviced. 

[0190] An interrupt that is masked will show up in the 
pending register but not in the unmasked pending register 
and will not generate a processor interrupt. If the mask is 
then cleared, that pending interrupt will flow through the 
unmasked pending register and generate a processor inter- 
rupt. 

[0191] The multiple components within the processor 
board 118 dictate various power requirements. The proces- 
sor board 118 requires 3.3V, 2.5V, and 1.8V. In the illustrated 
embodiment, there are two processor core voltage supplies 
302, 304 each driving two 1.8V cores for two processors 
(e.g., 228). There is also a 3.3V supply 306 and a 2.5V 
supply 308 which supply voltage to the remaining compo- 
nents (e.g., crossbar 206, interface block 205 and so on). To 
provide power to the board, the three voltages (e.g., the 1.8V, 
3.3V and 2.5V) have separate switching supplies, and proper 
power sequencing. All three voltages are converted from 
5.0V. The power to the processor card 118 is provided 
directly from the modem board 112 within the base station, 
however, in other embodiments there is a separate or oth- 
erwise integrated power supply. The power supply a pre- 
ferred embodiment is rated as 12A, however, in other 
embodiments the rating varies according to the specific 
component requirements. 

[0192] In the illustrated embodiment, for instance, the 
3.3V power supply 306 is used to provide power to the 
NVRAM 218 core, SDRAM 214, PXB 208, and crossbar 
ASIC 206 (or FPGA is present). This power supply is rated 
as a function of the devices chosen for these functions. 

[01 93] A 2.5V power supply 308 is used to provide power 
to the compute node ASIC 234 and can also power the PXB 



208 FPGA core. The host processor bus can run at 2.5V 
signaling. The host bus can operate at 2.5V signaling. 

[0194] The power-on sequencing is necessary in multi- 
voltage digital boards. One skilled in the art can appreciate 
that power sequencing is necessary for long-term reliability. 
The right power,supply sequencing can be accomplished by 
using inhibit signals. To provide fail-safe operation of the 
device, power should be supplied so that if the core supply 
fails during operation, the I/O supply is shut down as well. 

[0195] Although in theory, the general rule is to ramp all 
power supplies up and down at the same time as illustrated 
in FIG. 6. The ramp up 602 and ramp down 604 show 
agreement with the power supplies 302, 304, 306, 308 over 
time. One skilled in the art realizes that in reality, voltage 
increases and decreases do not occur among multiple power 
supplies in such a simultaneous fashion. 

[0196] FIG. 7 shown the actual voltage characteristics for 
the illustrated embodiment. As can be seen, ramp up 702<j- 
702c and ramp down 704a-704c sequences depend on 
multiple factors, e.g., power supply, total board capacities 
that need to be charged, power supply load, and so on. For 
example, the ramp up for the 3.3V supply 702a occurs 
before the ramp up for the 2.5V supply 702c, which occurs 
before the ramp up of the 1.8V supplies 7026. Further, the. 
ramp down for the 3.3V supply 704a occurs before the ramp 
down for the 2.5V supply 704c, which occurs before the 
ramp down for the 1.8V supplies 704 c. 

[0197] Also, The host processor requires the core supply 
to not exceed the I/O supply by more than 0.4 volts at all 
times. Also, the I/O supply must not exceed the core supply 
by more than 2 volts. Therefore, to achieve an acceptable 
power-up and power-down sequencing, e.g., to avoid dam- 
age to the components, a circuit containing diodes is used in 
conjunction with the power supplied within the base station. 

[0198] The power status/control device 240 is designed 
from a programmable logic device (PLD). The PLD is used 
to monitor the voltage status signals from the on board 
supplies. It is powered up from +5V and monitors +3.3V, 
+2.5V, 1.8V_1 and +1.8V_2. This device monitors the 
power_good signals from each supply. In the case of a power 
failure in one or more supplies, the PLD will issue a restart 
to all supplies and a board level reset to the processor board. 
A latched power status signal will be available from each 
supply as part of the discrete input word. The latched 
discrete can indicate any power fault condition since the last 
off-board reset condition. 

[0199] In operation, the processor board inputs raw 
antenna data from the base station modem card 112 (or other 
available location of that data), detects sources of interfer- 
ence within that data, and produces a new stream of data 
which has reduced interference subsequently transmitting 
that refined data back to the modem card (or other location) 
for further processing within the base station. 

[0200] As can be appreciated by one skilled in the art, such 
interference reduction is computationally complex; hence, 
the hardware must support throughputs sufficient for mul- 
tiple user processing. In a preferred embodiment, character- 
istics of processing are a latency of less than 300 microsec- 
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onds handing data in the 110 Mbytes/Sec range, however, in 
other embodiments the latency and data load can vary. 

[0201] In the illustrated embodiment, data from the 
modem board is supplied via the PCI bus 211& through the 
PCI bridge 222. From there, the data traverses the crossbar 
206 and is loaded into the host controller memory 205. 
Output data flows in the opposite direction. Additionally, 
certain data flows between the host controller 203 and the 
compute elements 220. 

Hybrid Operating System 

[0202] The compute elements 220 operate, in some 
embodiments, under the MC/OS operating system available 
commercially from the assignee herein, although different 
configurations can run under different operating systems 
suited for such. Here, one aspect is to reduce the use of 
non-POSIX system calls which can increase portability of 
the multiple user detection software among different hard- 
ware environments and operating system environments. The 
host processor is operated by the VxWorks operating sys- 
tem, as is required by MC/OS and suitable for a Motorola 
8240 PowerPC. 

[0203] FIG. 8 shows a block diagram of various compo- 
nents within the hardware/software environment. An 
MC/OS subsystem 802 is used as an operating system for the 
compute elements 220. Further, a MC/OS DX 804 provides 
APIs acceptable overhead and latency access to the DMA 
engines which in turn provide suitable bandwidth transfers 
of data. DX 804 can be used to move data between the 
compute elements 220 during parallel processing, and also 
to move data between the compute elements 220, the host 
controller 203, and the modem card 112. As described 
above, each compute element 220 continues an application 
806, and a watchdog 808. Further, the HA registers provide 
the bootstrap 810 necessary for start-up. 

[0204] The host controller 203 runs under the VxWorks 
operating system 812. The host processor 202 contains a 
watchdog 814, application data 816, and a bootstrap 818. 
Further, the host processor 202 can perform TCP/IP stack 
processing 820 for communication through the Ethernet 
interface 224. 

[0205] Input/output between the processor card 118 and 
the modem card 112 takes place by moving data between the 
Race++ Fabric and the PCI bus 2116 via the PCI bridge 222. 
The application 806 will use DX to initialize the PXB++ 
bridge, and to cause input/output data to move as if it were 
regular DX IPC traffic. For example, there are several 
components which can initiate data transfers and choose PCI 
addresses to be involved with the transfers. 

[0206] One approach to increasing available on the pro- 
cessor card 118 is to balance host-processing time against 
application execution. For example, when the system comes 
up, the application determines which processing resources 
are available, and the application determines a load mapping 
on the available resources and record certain parameters in 
NVRAM. Although briefs interruptions in service can occur, 
the application does not need to know how to continue 
execution across faults. For instance, the application can 
make an assumption that the hardware configuration will not 
change without the system first rebooting. If the application 
is in a state which needs to be preserved across reboots, the 



application checkpoints the data on a regular basis. The 
system software provides an API to a portion of the 
NVRAM for this purpose 

[0207] The host controller 203 is attached to an amount of 
linear flash memory 216 as discussed above. This flash 
memory 216 serves several purposes. The first purpose the 
flash memory serves is as a source of instructions to execute 
when the host controller comes out of reset. Linear flash can 
be addressed much like normal RAM. Flash memories can 
be organized to look like disk controllers; however in that 
configuration they generally require a disk driver to provide 
access to the flash memory. Although such an organization 
has several benefits such as automatic reallocation of bad 
flash cells, and write wear leveling, it is not appropriate for 
initial bootstrap. The flash memory 216 also serves as a file 
system for the host and as a place to store permanent board 
information (e.g., such as a serial number). 

[0208] When the host controller 203 first comes out of 
reset, memory is not turned on. Since high-level languages 
such as C assume some memory is present (e.g., for a stack) 
the initial bootstrap code must be coded in assembler. This 
assembler bootstrap contains a few hundred lines of code, 
sufficient to configure the memory controller, initialize 
memory, and initialize the configuration of the host proces- 
sor internal registers. 

[0209] After the assembler bootstrap has finished execu- 
tion, control is passed to the processor HA code (which is 
also contained in boot flash memory). The purpose of the HA 
code is to attempt to configure the fabric, and load the 
compute element CPUs with HA code. Once this is com- 
plete, all the processors participate in the HA algorithm. The 
output of the algorithm is a configuration table which details 
which hardware is operational and which hardware is not. 
This is an input to the next stage of bootstrap, the multi- 
computer configuration. 

[0210] MC/OS expects the host controller system to con- 
figure the multi-computer (e.g., compute elements 220). A 
configmc program reads a textual description of the com- 
puter system configuration, and produces a series of binary 
data structures that describe the system configuration. These 
data structures are used in MC/OS to describe the routing 
and configuration of the multi-computer. 

[0211] The processor bo ard 118 will use almost exactly the 
same sequence to configure the multi-computer. The major 
difference is that MC/OS expects configurations to be static, 
whereas the processor board configuration changes dynami- 
cally as faulty hardware cause various resources to be 
unavailable for use. 

[0212] One embodiment of the invention uses binary data 
structures produced by configmc to modify flags that indi- 
cate whether a piece of hardware is usable. A modification 
to MC/OS prevents it from using hardware marked as 
broken. Another embodiment utilizes the output of the HA 
algorithm to produce a new configuration file input to 
configmc, the configmc execution is repeated with the new 
file, and MC/OS is configured and loaded with no knowl- 
edge of the broken hardware whatsoever. This embodiment 
can calculate an optimal routing table in the face of failed 
hardware, increasing the performance of the remaining 
operational components. 

[0213] After the host controller has configured the com- 
pute elements 220, the runmc program loads the functional 
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compute elements with a copy of MC/OS. Because access to 
the processor board 118 from a TCP/IP network is required, 
the host computer system acts as a connection to the TCP/IP 
network. The VxWorks operating system contains a fully 
functional TCP/IP stack. When compute elements access 
network resources, the host computer acts as proxy, 
exchanging information with the compute element utilizing 
DX transfers, and then making the appropriate TCP/IP calls 
on behalf of the compute element 

[0214] The host controller 203 needs a file system to store 
configuration files, executable programs, and MC/OS 
images. For this purpose, flash memory is utilized. Rather 
than have a separate flash memory from the host controller 
boot flash, the same flash is utilized for both bootstrap 
purposes and for holding file system data. The flash file 
system provides DOS file system semantics as well as write 
wear leveling. 

[0215] There are in particular, two portions of code which 
can be remotely updated; the bootstrap code which is 
executed by the host controller 203 when it comes out of 
reset, and the rest of the code which resides on the flash file 
system as files. 

[0216] When code is initially downloaded to the processor 
board 118, it is written as a group of files within a directory 
in the flash file system. A single top-level index tracks which 
directory tree is used for booting the system. This index 
continues to point at the existing directory tree until a 
download of new software is successfully completed. When 
a download has been completed and verified, the top-level 
index is updated to point to the new directory tree, the boot 
flash is rewritten, and the system can be rebooted. 

[0217] Fault detection and reporting 820, 822 is performed 
by having each CPU in the system gather as much infor- 
mation about what it observed during a fault, and then 
comparing the information in order to detect which compo- 
nents could be the common cause of the symptoms. In some 
cases, it may take multiple faults before the algorithm can 
detect which component is at fault. 

[0218] Failures within the processor board 118 can be a 
single point failure. Specifically, everything on the board is 
a single point of failure except for the compute elements. 
This means that the only hard failures that can be configured 
out are failures in the compute elements 220. However, 
many failures are transient or soft, and these can be recov- 
ered from with a reboot cycle. 

[0219] In the case of hard failure of a compute element 
220, the application executes with reduced demand for 
computing resources. For example, the application may 
work with a smaller number of interference sources, or 
perform interference cancellation iterations, but still within 
a tolerance. 

[0220] Failure of more than a single compute element will 
cause the board to be inoperative. Therefore, the application 
only needs to handle two configurations: all compute ele- 
ments functional and 1 compute element unavailable. Note 
that the single crossbar means that there are no issues as to 
which processes need to go on which processors — the 
bandwidth and latencies for any node to any other node are 
identical on the processor board, although other methods 
and techniques can be used. 



DSP Connected to Processing Board 
[0221] FIG. 9 shows an embodiment of the invention 
. wherein a digital signal processor (DSP) 900 is connected 
with the processor board 118. Such configuration enables a 
DSP to communicate via DMA with processor board. One 
skilled in the art can appreciate that DMA transfers can be 
faster than bus transfers, and hence, throughput can be 
increased. Shown, is a DSP processor, a buffer, a FPGAand 
a crossbar. 

[0222] The DSP 900 generates a digital signal correspond- 
ing to an analog input, e.g., a rake receiver. The DSP 900 
operates in real-time, hence, the output is clocked to perform 
transfers of the digital output. In the illustrated embodiment, 
the DSP can be a Texas Instruments model TMS320C67XX 
series, however, other DSP processors are commercially 
available which can satisfy the methods and systems herein. 

[0223] A buffer 902 is coupled with the DSP 900, and 
receives and send data in a First-In First-Out (e.g., queue) 
fashion, also referred to as a FIFO buffer. The buffer 902, in 
some embodiments, can be dual-ported RAM of sufficient 
size to capture data transfers. One skilled in the art can 
appreciate, however, that a protocol can be utilized to 
transfer the data where the buffer or dual-ported RAM is 
smaller that the data transfer size. 

[0224] A FPGA 904 is coupled with both the buffer 902 
and an crossbar 906 (which can be the same crossbar 
coupled with the compute elements 220 and host controller 
203). The FPGA 904 moves data from the buffer 902 to the 
crossbar 906, which subsequently communicates the data to 
further devices, e.g., a RACEway™ or the host controller 
203 or compute elements 220. The FPGA 904 also perform 
data transfers directly from the DSP 900 to the crossbar 906. 
This method is utilized in some embodiments where data 
transfer sizes can be accommodated without buffering, for 
instance, although either the buffer or direct transfers can be 
used. 

[0225] The DSP 900 contains at least one external memory 
interface (EMIF) 908 device, which is connected to the 
buffer 902 or dual-ported RAM. RACEway™ transfers 
actually access the RAM, and then additional processing 
takes place within the DSP to move the data to the correct 
location in SDRAM within the DSP. In embodiments where 
the RAM is smaller that the data transfer size, then there is 
a massaging protocol between two endpoint DSPs exchang- 
ing messages, since the message will be fragmented to be 
contained within the buffer or RAM. 

[0226] As more RACEway™ endpoints are added (for 
instance, to increase speed or throughput), the size of the 
dual-port RAM can be increased to a size of 2*F*N*P 
buffers of size F, where F is the fragment size, N is the 
number of RACEway™ endpoints in communication with 
the DSP, and P is the number of parallel transfers which can 
be active on an endpoint. The constant 2 represents double 
buffering so one buffer can be transferred to the RACE- 
way™ simultaneously with a buffer being transferred to the 
DSP. One skilled in the art can appreciate that the constant 
can be four times rather than two times to emulate a 
full-duplex connection. With a 4 mode system, this could be, 
for example, 4*8K*4*4 or 512 Kbytes, plus a overhead 
factor for configuration and data tracking. 

[0227] The FPGA 904 can program the DMA controller 
910 within the DSP 900 to move data between the buffer 902 
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and the DSP/SDRAM 912 directly from a DSP host port 
914. The host port 914 is a peripheral like the EMIF 908, but 
can master transfers into the DSP data-paths, e.g., it can read 
and write any location within the DSP. Hence, the host port 
914 can access the DMA controller, 910 and can be used to 
initiate transfers via the DMA engine. One skilled in the art 
can appreciate that using this architecture, RACEway™ 
transfers can be initiated without the cooperation of the DSP, 
the thus, the DSP is free to continue processing while 
transfers take place and further, there is no need for protocol 
messaging within the buffer. 

[0228] The FPGA 904 can also perform fragmentation of 
data. In embodiments where the buffer device is a dual-port 
RAM, the FPGA 904 an program the DMA controller within 
the DSP to move fragments into or out-of the DSP. This 
method can be used to match throughput of the external 
transfer bus, e.g., the RACEway™. 

[0229] An example of the methods and systems described 
for a DSP, is as follows. In an embodiment where the 
RACEway™ reads date out of the DSP memory 912, this 
example assumes that another DSP is reading the SDRAM 
of the local DSP. The FPGA 904 detects a RACEway™ data 
packet arriving, and decodes the packet to determine that is 
contains instructions for a data-read at, for example, 
memory location 0x10000. The FPGA 904 writes over the 
host port interface 914 to program the DMA controller 910 
to transfer data starting at memory location Ox 10000, which 
refers to a location in the primary EMIF 908 corresponding 
to a location in the SDRAM 912, and to move that data to 
a location in the secondary EMIF (e.g., the buffer device) 
902. As data arrives in the buffer 902, the FPGA 904 reads 
the data out of the buffer, and moves it onto the RACEway™ 
bus. When a predetermined block of data is moved, the 
DMA controller 910 finishes the transfer, and the FPGA 904 
finishes moving the data from the buffer 902 to the RACE- 
way™. 

[0230] Another example assumes that another DSP is 
requesting a write instruction to the local DSP. Here, the 
FPGA 904 detects a data packet arriving, and determines 
that is it a write to location 0x20000, for instance. The FPGA 
904 fills some amount of the buffer 902 with the data from 
the RACEway™ bus, and then writes over the host port 914 
interface to program the DMA controller 910. The DMA 
controller 910 then transfers data from the buffer device 902 
and writes that data to the primary EMIF 908 at address 
0x20000. At the conclusion of the transfer, an interrupt can 
be sent to the DSP 900 to indicate that a data packet has 
arrived, or a polling of a location in the SDRAM 912 can 
accomplish the same requirement. 

[0231] These two examples are non-limiting example, and 
other embodiments can utilize different methods and devices 
for the transfer of data between devices. For example, if the 
DSP 900 utilizes RapidIO interfaces, the buffer 902 and 
FPGA 904 can be modified to accommodate this protocol. 
Also, the crossbar 906 illustrated may be in common with a 
separate bus structure, or be in common with the processor 
board 118 described above. Even further, in some embodi- 
ments, the FPGA 904 can be directly coupled with the board 
processor, or be configured as a compute node 220. 

[0232] Therefore, as can be understood by one skilled in 
the art, the methods and systems herein are suited for 
multiple user detection within base stations, and can be used 
to accommodate both short-code and long-code receivers. 



Short-Code Processing 

[0233] In one embodiment of the invention using short- 
code receivers, a possible mapping of matrices necessary for 
short-code mapping is now discussed. In order to perform 
MUD at the symbol rate, the correlation between the user 
channel-corrupted signature waveforms must be calculated. 
These correlations are stored as elements in matrices, here 
referred to as R-matrices. Because the channel is continually 
changing, the correlations need be updated in real-time. 

[0234] The implementation of MUD at the symbol rate 
can be divided into two functions. The first function is the 
calculation of the R-matrix elements. The second function is 
interference cancellation, which relies on knowledge of the 
R-matrix elements. The calculation of these elements and 
the computational complexity are described in the following 
section. Computational complexity is expressed in Giga- 
Operations Per Second (GOPS). The subsequent section 
describes the MUD IC function. The method of interference 
cancellation employed is Multistage Decision Feedback IC 
(MDFIC). 

[0235] The R-matrix calculations can be divided into three 
separate calculations, each with an associated time constant 
for real-time operation, as follows: 
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[0236] Where the hats are omitted otherwise indicating 
parameter estimates. Hence we must calculate the R-matri- 
ces, which depend on the C-matrices, which in turn, depend 
on the r-matrix. The r-matrix has the slowest time constant. 
This matrix represents the user code correlations for all 
values of offset m. For a case of 100 voice users the total 
memory requirement is 21 MBytes based on two bytes (real 
and imaginary parts) per element. This matrix is updated 
only when new codes (e.g., new users) are added to the 
system. Hence this is essentially a static matrix. The com- 
putational requirements are negligible. 

[0237] The most efficient method of calculation depends 
on the non-zero length of the codes. For high data-rate users 
the non-zero length of the codes is only 4-chips long. For 
these codes, a direct convolution is the most eflicientmethod 
to calculation the elements. For low data-rate users it is more 
efficient to calculation the elements using the FFT to per- 
form the convolutions in the frequency domain. Further, as 
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can be appreciated by one skilled in the art, cache memory 
can be used where the matrix is somewhat static compared 
with the update of other matrices. 

[0238] The C-matrix is calculated from the r -matrix. 
These elements must be calculated whenever a user's delay 
lag changes. For now, assume that on average each multi- 
path component changes every 400 ms. The length of the g[] 
function is 48 samples. Since we are over sampling by 4, 
there are 12 multiply-accumulations (real x complex) to be 
performed per element, or 48 operations per element. When 
there are 100 low-rate users on the system (i.e., 200 virtual 
users) and a single multi path lag (of 4) changes for one user 
a total of (1.5)(2)KvLNv elements must be calculated. The 
factor of 1.5 comes from the 3 C-matrices (mV-1, 0, 1), 
reduced by a factor of 2 due to a conjugate symmetry 
condition The factor of 2 results because both rows and 
columns must be updated. The factor Nv is the number of 
virtual users per physical user, which for the lowest rate 
useis is Nv=2. In total then this amounts to 230,400 opera- 
tions per multi-path component per physical user. Assuming 
100 physical users with 4 multi -path components per user, 
each changing once per 400 ms gives 230 MOPS. 

[0239] The R-matrices arc calculated from the C-matrices. 
From the equation above the R-matrix elements are 

L L 



[0240] where a k are Lxl vectors, and QJm 1 ] are LxL 
matrices. The rate at which these calculations must be 
performed depends on the velocity of the users. The selected 
update rate is 1.33 ms. If the update rate is too slow such that 
the estimated R-matrix values deviate significantly from the 
actual R-matrix values then there is a degradation in the 
MUD efficiency. 

[0241] From the above equation the calculation of the 
R-matrix elements can be calculated in terms of an X-matrix 
which represents amplitude-amplitude multiplies: 
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[0242] The X-matrix multiplies can be reused for all 
virtual users associated with a physical user and for all m T 
(i.e. m'=0, 1). Hence these calculations are negligible when 
amortized. The remaining calculations can be expressed as 
a single real dot product of length 2L 2=32. The calculations 
are performed in 16-bit fixed-point math. The total opera- 
tions is thus 1.5(4)(KvL)2»3.84 Mops. The processing 
requirement is then 2.90 GOPS. The X-matrix multiplies 
when amortized amount to an additional 0.7 GOPS. The 
total processing requirement is then 3.60 GOPS. 



[0243] From the equation above the matched-filter outputs 
are given by: 



y { [m) = mWr'Am] + g r ft [- 1 \b k [m + I] +■ 

g [r tt [01-r«[0]* tt ]M«] + g *[UM«-l] + i»M 



[0244] The first term represents the signal of interest. All 
the remaining terms represent Multiple Access Interference 
(MAI) and noise. The multiple-stage decision-feedback 
interference cancellation (DFIC) algorithm iteratively solves 
for the symbol estimates using 



ht[m] = signOvM - g r ft [-l]b t \m + 1] - 

g [ra [0J - r u [0]«5 tt \b k [m] - g r tt [lft [m - 1] 



[0245] with initial estimates given by hard decisions on 
the matched-filter detection statistics, 6j[m]=sign {yjm]}. 
The MDFIC technique is closely related to the SIC and PIC 
technique. Notice that new estimates are immediately intro- 
duced back into the interference cancellation as they are 
calculated. Hence at any given cancellation step the best 
available symbol estimates are used. This idea is analogous 
to the Gauss-Siedel method for solving diagonally dominant 
linear systems. 

[0246] The above iteration is performed on a block of 20 
symbols, for all users. The 20-symbol block size represents 
two WCDMA time slots. The R-matrices are assumed to be 
constant over this period. Performance is improved under 
high input BER if the sign detector in is replaced by the 
hyperbolic tangent detector. This detector has a single slope 
parameter which is variable from iteration to iteration. 
Similarly, performance is improved if only a fraction of the 
total estimated interference is cancelled (e.g., partial inter- 
ference cancellation), owing to channel and symbol estima- 
tion errors. 

Multiple Processors Generating Complementary 
R -Ma trices 

[0247] The three R-matrices (R[-l], R[0] and R[l]) are 
each KvxKv in size. The total number of operation then is 
per iteration. The computational complexity of the 
multistage MDFIC algorithm depends on the total number of 
virtual users, which depends on the mix of users at the 
various spreading factors. For Kv=*200 users (e.g. 100 low- 
rate users) this amounts to 240,000 operations. In the current 
implementation two iterations are used, requiring a total of 
480,000 operations. For real-time operation these operations 
must be performed in 1/15 ms. The total processing require- 
ment is then 7.2 GOPS. Computational complexity is mark- 
edly reduced if a threshold parameter is set such that IC is 
performed only for values \yi[m]\ below the threshold. The 
idea is that if |y t [m]| is large there is little doubt as to the sign 
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of bjfm], and IC need not be performed. The value of the 
threshold parameter is variable from stage to stage. 

[0248] Although three R matrices are output from the R 
matrix calculation function, only half of the elements are 
explicitly calculated. This is because of symmetry that exists 
between R matrices: 

[0249] Therefore, only two matrices need to be calculated. 
The first one is a combination of R(l) and R(-l). The second 
is the R(0) matrix. In this case, the essential R(0) matrix 
elements have a triangular structure to them. The number of 
computations performed to generate the raw data for the 
R(l)/R(-1) and R(0) matrices are combined and optimized 
as a single number. This is due to the reuse of the X-matrix 
outer product values across the two R-matrices. Since the 
bulk of the computations involve combining the X-matrix 
and correlation values, they dominate the processor utiliza- 
tion. These computations are used as a cost metric in 
determining the optimum loading of each processor. 

Processor Loading Optimization 

[0250] The optimization problem is formulated as an equal 
area problem, where the solution results in each partition 
area to be equal. Since the major dimensions of the R-ma- 
trices are in terms of the number of active virtual users, the 
solution space for this problem is in terms of the number of 
virtual users per processor. By normalizing the solution 
space by the number of virtual users, the solution is appli- 
cable for an arbitrary number of virtual users. 

[0251] FIG. 10 shows a model of the normalized optimi- 
zation scenario. The computations for the R(l)/R(-1) matrix 
are represented by the square HJKM, while the computa- 
tions for the R(0)matrix are represented by the triangle ABC. 
From geometry, the area of a rectangle of length b and height 
h is: 

A^bh 

[0252] For a triangle with a base width b and height h, the 
area is calculated by: 

[0253] When combined with a common height a, the 
formula for the area becomes: 

Ai = A ri +A ti 



[0254] The formula for A gives the area for the total region 
below the partition line. For example, the formula for A2 
gives the area within the rectangle HQRM plus the region 



within triangle AFG. For the cost function, the difference in 
successive areas is used. That is: 

Bi = Ai-Ai-i 

[0255] For an optimum solution, the B must be equal for 
i-1,2, . . . ,N , where N is the number of processors 
performing the calculations. Because the total normalized 
load is equal to AN, the loading per processor load is equal 
to AN/N, 

[0256] By combining the two equations for B, the solution 
for a, is found by finding the roots of the equation: 

1 2 1 2 3 n 

2-?+«i-2**-«"-2»" 0< 
[0257] The solution for a is: 



a-, = -1 ± J 1 + a£ u + 2a,-_i + — , for / - 1. 2 N 



[0258] Since the solution space must fall in the range 
[0,1], negative roots are not valid solutions to the problem. 
On the surface, it appears that the a must be solved by first 
solving for case where=l. However, by expanding the 
recursions of the a and using the fact that aO equals zero, a 
solution that does not require previous a ,-0, 1, . . . ,n-l 
exists. The solution is: 

fl;= - i+ >/^¥ 

[0259] As shown in the following table, the normalized 
partition values for two, three, and four processors. To 
calculate the actual partitioning values, the number of active 
virtual users is multiplied by the corresponding table entries. 
Since a fraction of a user cannot be allocated, a ceiling 
operation is performed that biases the number of virtual 
users per processor towards the processors whose loading 
function is less sensitive to perturbations in the number of 
users. 
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Location TWo Processors Three Processors Four Processors 



-l + vT(0.4142) R 



" 

a 2 — -1 + V3"(0.7321) 



fl 3 — 



(0.5811) " 1 + a/4 (0-3229) 



-1 + J - (0.5811) 



-1 + J — (0.8028) 



[0260] One skilled in the art can appreciate that the load 
balancing for the R-matrix results in a non-uniform parti- 
tioning of the rows of the final matrices over a number of 
processors. The partition sizes increase as the partition 
starting user index increases. When the system is running at 
full capacity (e.g., all co-processors are functional, and the 
maximum number of users is processed while still within the 
bounds of real-time operation), and a co-processor fails, the 
impact can be significant. 

[0261] This impact can be minimized by allocating the 
first user partition to the disabled node. Also the values that 
would have been calculated by that node are set to zero. This 
reduces the effects of the failed node. By changing which 
user data is set to zero (e.g., which users are assigned to the 
failed node) the overall errors due to the lack of non-zero 
output data for that node are averaged over all of the users, 
providing a "soft" degradation. 

R, C Values Contiguous in MPIC Processor 
Memory 

[0262] Further, via connection with the crossbar multi- 
port connector, the multi-processor elements calculating the 
R-matrix (which depends on the C-matrix, which in turn 
depends on the gamma-matrix) can place the results in a 
processor element performing the MPIC functions. For one 
optimal solution, die values can be placed in contiguous 
locations accessable (or local with) the MPIC processor. 
This method allows adjacent memory addresses for the R 
and C values, and increases throughput via simply incre- 
menting memory pointers rather that using a random access 
approach. 

[0263] As discussed above, the values of the r-matrix 
elements which are non-zero need to be determined for 
efficient storage of the r-matrix. For high data rate users, 
certain elements Cj[n] are zero, even within the interval 
n=0:N-l, N-256. These zero values reduce the interval over 
which Tjjm] is non-zero. In order to determine the interval 
for non-zero values consider the following relations: 



1. Correspondingly, the vector cjn] is non-zero only over 
the interval n-j^N,,: J k N k +N k -l. Given these definitions, 
r^m] can be rewritten as 




[0265] The minimum value of m for which T^m] is 
non-zero is 

[0266] and the maximum value of m for which Tjjm] is 
non-zero is 

[0267] The total number of non-zero elements is then 

m utai ~ "lm«2 — tft in-in l + 1 
= iV, + iV A - 1 

[0268] The table below provides a sample of the the 
number of bytes per l,k virtual-user pair based on 2 bytes per 
element — one byte for the real part and one byte for the 
imaginary part. In other embodiments, these values vary. 





N k = 256 


128 


64 


32 


16 


8 


4 


Si - 256 


1022 


766 


638 


574 


542 


526 


518 


128 


766 


510 


382 


318 


286 


270 


262 


64 


638 


382 


254 


190 


158 


142 


134 


32 


574 


318 


190 


126 


94 


78 


70 


16 


542 


286 


158 


94 


62 


46 


38 


8 


526 


270 


142 


78 


46 


30 


22 


4 


518 


262 


134 


70 


38 


22 


14 




[0264] The index jj for the 1th virtual user is defined such 
that cfn] is non-zero only over the interval n-jjNp jM+Ni- 



[0269] The memory requirements for storing the r-matrix 
for a given number of users at each spreading factor can be 
determined as described below. For example, for virtual 
users at spreading factor N q s2 8 " q , q=0:6, where is the qth 
element of the vector K (some elements of K may be zero), 
the storage requirement can be computed as follows. Let the 
table above be stored in matrix M with elements M qq For 
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example, M 0O =1022, and Mg, =766. The total memory 
required by the T matrix in bytes is then given by the 
following relation 

9=0 ( <^ + J J 



[0270] Then, continuing the example, for 200 virtual users 
at spreading factor N 0 =256, Kq-2008 0 , which in turn 
results in M byte «HK o (K o +l)M 0O -100(201X1022)-20.5 
MB. For 10 384 Kbps users, K^KoS^+K^ with Ko-10 
and K^-640, which results in a storage requirement that is 
given by the following relations: 



M bytn = 5 K 0 (Ko + DMoo + KoKeMoe + 5 K 6 (K 6 + 1)M« = 

5(11)(1022) + 10(640)(5U) + 320(641)(L4) = 6.2 MB. 



[0271] The T-matrix data can be addressed, stored, and 
accessed as described below. In particular, for each pair (l,k), 
k>=l, there are 1 complex T^m] values for each value of m, 
where m ranges from m min2 to m min2 , and the total number 
of non-zero elements is m total «m ma3E2 -m min2 +1 - Hence, for 
each pair (l,k), k>=l, there exists 2m total time-contiguous 
bytes. 

[0272] In one embodiment, an array structure is created to 
access the data, as shown below: 



struct { 

int m_min2; 

int m_max2; 

int m total; 

char * Glk; 

} G_info{N_VU_MAX]IN_VU_MAXJ 



-continued 

while m^p.n > 0 

suml +- (*ptil++) * (*ptr2++) 

end 

Cjm'lllkjqjq'] - suml 

end 

end 

end 

end 



[0274] A direct method for calculating the C-matrix (in 
symmetry) is performance of the following equation: 



C^ q [m'}= —C kiK ,[m'\ 



[0275] Due to symmetry, there are 1.5(K V L) 2 elements to 
calculate. Assuming all users are at SF 256, each calculation 
requires 256 cmacs, or 2048 operations. The probability that 
a multipath changes in a 10 ms time period is approximately 
10/200=0.05 if all users are at 120 kmph. Assuming a mix 
of user velocities, a reasonable probability is 0.025. Because 
the C-matrix represents the interaction between two users, 
the probability that C-matrix elements change in a 10 ms 
time period is approximately 0.10 for all users at 120 kmph, 
or 0.05 for a mix of users velocities. Hence, the GOPS are 
shown in the following table. 





High velocity 
users 


1.5(KvL)2 Gops 


Percentage change 


GOPS 


200 


100% 


960,000 1.966 


20 


393 


200 


50% 


960,000 1.966 


15 


295 


128 


100% 


393,216 0.805 


20 


16.1 


128 


50% 


393,216 0.805 


15 


12.1 



[0276] One skilled in the art can appreciate that a fast 
fourier transform (FFT) can be used to calculate the corre- 
lations for a range of offsets, tau, using: 



[0273] The C-matrix data can then be retrieved by utiliz- 
ing the following exemplary algorithm: 



m^ - G_jnfo[lIkJ.m_mui2 
- G_info[lIkj.m_max2 

Nl - m"N - L 6 /(2N C ) 
for m' - 0:1 

for q = 0:L-1 

for q' - 0:1^1 

m^hi ■ Nl - + a kq . 

m mnl m kintal + Ng 

nimin ™ max(m rn £ nl , nVajaJ 

ifWm« >" ^mln 

m«p«j ° iVx - nixoto + 1 
suml - 0.0; 

ptrl - AG.JnfotllkJ.GlkCmJ 
ptr2 -» &g[m»ta • N c + x] 



<V,lm'] = ^-Yj Sk[nN ° + ^ - fy] cUn] 

n 

= C ft [r tt? ^[m / ]] 



[0277] The length of the waveform sk[t] is Lg+255N C « 
1068 for L g «48 and N c 4. This is represented as N c wave- 
forms of length Lg/Nc+255-267. One advantage of this 
approach is that elements can be stored for a range of oflsets 
tau so that calculations do not need to be performed when 
lags change. For delay spreads of about 4 micro-seconds 32 
samples need to be stored for each m\ 
[0278] The C-matrix elements need be updated when the 
spreading factor changes. The spreading factor can change 
du to AMR codec rate changes, multiplexing of the dedi- 
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cated channels, or multiplexing of data services, to name a 
few reasons. It is reasonable to assume that 5% of the users, 
hence 10% of the elements, change every 10 ms. 

Gamma-Matrix Generated in FPGA 

[0279] The C-matrix elements can be represented in terms 
of the underlying code correlations using: 

* n 

»£«NV, + T]T»[«] 

Bt 

r fl M = ^"Jj c/[n] ■<*[»- m] 



[0280] If the length of g[t] is Lg=48 and Nc=4, thea the 
summation over m requires 48/4=12 macs for the real part 
and 12 macs for the imaginary part. The total ops is then 48 
ops per element. (Compare with 2048 operations for the 
direct method.) Hence for the case where there are 200 
virtual users and 20% of the C-matrix needs updating every 
10 ms the required complexity is (960000 el)(48 ops/ 
el)(0.20)/(0.010 sec)=921.6 MOPS. This is the required 
complexity to compute the C-matrix from the Tau -matrix. 
The cost of computing the Tau -matrix must also be consid- 
ered. The Tau-matrix can be efficiently computed since the 
fundamental operation is a convolution of codes with ele- 
ments constrained to be +/-l+/-j. Further, the Taumatrix can 
be calculated using modulo-2 addition (e.g., XOR) using 
several method, e.g. register shifting, XOR logic gates, and 
so on. 

[0281] The Gamma matrix (T) represents the correlation 
between the complex user codes. 

[0282] The complex code for user 1 is assumed to be 
infinite in length, but with only Ni non-zero values. The 
non-zero values are constrained to be ±l±j. The T-matrix 
can be represented in terms of the real and imaginary parts 
of the complex user codes, and is based on the relationship: 

rf W = ^{Mi r [m]-^[m}) 
Mi Y [m\ b £ ■ m,* [n] ■ ml [n - m] 

n 



{0283] which can be performed using a dual-set of shift 
registers and a logical circuit containing modulo-2 (e.g., 
Exclusive-OR "XOR" ) logic elements. Further, one skilled 
in the art can appreciate that such a logic device can be 
implemented in a field programmable gate array, which can 
be programmed via the host controller, a compute element, 
or other device including an application specific integrated 
circuit. Further, the FPGA can be programmed via the 
RACE-way™ bus, for example. 



[0284] The above shift registers together with a summa- 
tion device calculates the functions M^^fm] and N^**" 
[m]. The remaining calculations to form T^^fm] and 
subsequently T^m] can be performed in software. Note that 
the four functions r^^m] corresponding to X, Y=R,I 
which are components of can be calculated in parallel. For 
K v =200 virtual users, and assuming that 10% of all (l,k) 
pairs must be calculated in 2 ms, then for real-time operation 
we must calculate 0.10(200) 2 o4000 elements (all shifts) in 
2 ms, or about 2M elements (all shifts) per second. For 
K v ol28 virtual users the requirement drops to 0.8192M 
elements (all shifts) per second. 

[0285] In what has been presented the elements are cal- 
culated for all 512 shifts. Not all of these shifts are needed, 
so it is possible to reduce the number of calculations per 
elements. The cost is increased design complexity. 

[028 6] Therefore, a possible lo ading scenario for perform- 
ing short-code multiple user detection on the hardware 
described herein is illustrated in FIG. 11. A processor board 
118 with four compute elements 220 can be used as shown. 
Three of the compute nodes (e.g., 220a -220c) can be used to 
calculate the C-matrix and R-matrix. One of the compute 
nodes (e.g., 220(f) can be used for multiple-stage decision- 
feedback interference cancellation (MDFIC) techniques. 
The Tau-Matrix and R-Matrix is calculated using FPGA's 
that can be programmed by the host controller 203, or 
ASICs. Further, multiuser amplitude estimation is per- 
formed within the modem card 112. 

Long-Code Processing 
[0287] Therefore it can be appreciated by one skilled in 
the art that short-code MUD can be performed using the 
system architecture described herein. FIG. 12 shows a 
preferred embodiment for long-code MUD processing. In 
this embodiment, each frame of data is processed three times 
by the MUD processor, although it can be recognized that 
multiple processors can perform the iterative nature of the 
embodiment. During the first pass, only the control channels 
are respread which the maximum ratio combination (MRC) 
and MUD processing is performed on the data channels. 
During subsequent passes, data channels are processed 
exclusively. New y (i.e., soft decisions) and b (ie., hard 
decisions) data are derived as shown in the diagram. 

[0288] Amplitude ratios and amplitudes are determined 
via the DSP (e.g., element 900, or a DSP otherwise coupled 
with the processor board 118 and receiver 110), as well as 
certain waveform statistics. These values (e.g., matrices and 
vectors) are used by the MUD processor in various ways. 
The MUD processor is decomposed into four stages that 
closely match the structure of the software simulation: Alpha 
Calculation and Respread 1302, raised-cosine filtering 1304, 
de-spreading 1306, and MRC 1308. Each pass through the 
MUD processor is equivalent to one processing stage of the 
software implementation. The design is pipelined and "par- 
allelized:" In the illustrated embodiment, the clock speed 
can be 132 MHz resulting in a throughput of 233 ms/framc, 
however, the clock rate and throughput varies depending on 
the requirements. The illustrated embodiment allows for 
three-pass MUD processing with additional overhead from 
external processing, resulting in a 4-times real-time process- 
ing throughput. 

[028 9] The alpha calculation and respread operations 1302 
are carried out by a set of thirty-two processing elements 
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arranged in parallel. These can be processing elements 
within an ASIC, FPGA, PLD or other such device, for 
example. Each processing element processes two users of 
four ringers each. Values for b are stored in a double- 
buffered lookup table. Values of a(hat) and ja(hat) are 
pre-multiplied with beta by an external processor and stored 
in a quad-buffered lookup table. The alpha calculation state 
generated the following values for each finger, where sub- 
scripts indicate antenna identifier: 

a 0 = A-(c-ao-yc-;ao) 

ai=fi r (C-& l -jCja l ) 



[0290] These values are accumulated during the serial 
processing cycle into four independent 8-times oversam- 
pling buffers. There are eight memory elements in each 
buffer and the element used is determined by the sub-chip 
delay setting for each finger. 

[0291] Once eight fingers have been accumulated into the 
oversampling buffer, the data is passed into set of four 
independent adder-trees. These adder-trees each termination 
in a single output, completing the respread operation. 

[0292] The four raised-cosine filters 1304 convolve the 
alpha data with a set of weights determined by the following 
equation: 

44).c44) 



[0293] The filters can be implemented with 97 taps with 
odd symmetry. The filters illustrated run at 8-times the chip 
rate, however, other rates are possible. The filters can be 
implemented in a variety of compute elements 220, or other 
devices such as ASICs, FPGAs for example. 

[0294] The despread function 1306 can be performed by a 
set of thirty-two processing elements arranged in parallel. 
Each processing element serially processes two users of four 
fingers each. 

[0295] For each finger, one chip value out of eight, 
selected based on the sub-chip delay, is accepted from the 
output of the raised-cosine filter. The despread state per- 
forms the following calculations for each finger (subscripts 
indicate antenna): 

SF-1 
0 

SF-1 

= Yj C / V o-/ C r o 

0 



-continued 

SF-1 
0 

SF-l 
0 



[0296] The MRC operations are carried out by a set of four 
processing elements arranged in parallel, such as the com- 
pute elements 220 for example. Each processor is capable of 
serially processing eight users of four fingers each. Values 
for y are stored in a double-buffered lookup table. Values for 
b are derived from the MSB of the y data. Note that the b 
data used in the MUD stage is independent of the b data used 
in the respread stage. Values of a and j a <are pre-multiplied 
with p by an external processor and stored in a quad- 
buffered lookup table. Also, 2(a 2 +ja 2 ) for each channel is 
stored in a quad-buffered table. 

[0297] The output stage contains a set of sequential des- 
tination buffer pointers for each channel. The data generated 
by each channel, on a slot basis, is transferred to the 
RACEway™ destination indicated by these buffers. The first 
word of each of these transfers will contain a counter in the 
lower sixteen bits indicating how many y values were 
generated. The upper sixteen bits will contain the constant 
value 0xAA55. This will allow the DSP to avoid interrupts 
by scanning the first word of each buffer. 

[0298] In addition, the DSPJJPDATE register contains a 
pointer to single RACEway™ location. Each time a slot or 
channel data is transmitted, an internal counter is written to 
this location. The counter is limited to 10 bits and will wrap 
around with a terminal count value of 1023. 

[02 9 9] The method of operation for the long-code multiple 
user detection algorithm (LCMUD) is as follows. Spread 
factor for four-channels requires significant amount of data 
transfer. In order to limit the gate count of the hardware 
implementation, processing an SF4 channel can result in 
reduced capability. 

[0300] A SF4 user can be processed on certain hardware 
channels. When one of these special channels is operating on 
an SF4 user, the next three channels are disabled and are 
therefore unavailable for processing. This relationship is as 
shown in the following table: 



SF4 Chan 


Disabled Channels 


SF4 Chan 


Disabled Channels 


0 


1,2,3 


32 


33, 34, 35 


4 


5, 6,7 


36 


37, 38, 39 


8 


9, 10, 11 


40 


41, 42, 43 


12 


12, 14, 15 


44 


45, 46, 47 


16 


17, 18, 19 


48 


49, 50,51. 


20 


21, 22, 23 


52 


53, 54, 55 


24 


25, 26, 27 


56 


57, 58, 59 


28 


29, 30, 31 


60 


61, 62, 63 



[0301] The default y and b data buffers do not contain 
enough space for SF4 data. When a channel is operating on 
SF4 data, the y and b buffers extend into the space of the next 
channel in sequence. For example, if channel 0 is processing 
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SF data, the channel 0 and channel 1 b buffers are merged 
into a single large buffer of 0x40 32-bit words. The y buffers 
are merged similarly. 

[0302] In typical operation, the first pass of the LCMUD 
algorithm will respread the control channels in order to 
remove control interference. For this pass, the b data for the 
control channels should be loaded into BLUT while the y 
data for data channels should be loaded into YDEC. Each 
channel should be configured to operate at the spread factor 
of the data channel stored into the YDEC table. 

[0303] Control channels are always operated at SF 256, so 
it is likely that the control data will need to be replicated to 
match the data channel spread factor. For example, each bit 
(b entry) of control data would be replicated 64 times if that 
control channel were associated with an SF 4 data channel. 

[0304] Each finger in a channel arrives at the receiver with 
a different delay. During the Respread operation, this skew 
among the fingers is recreated. During the MRC stage of 
MUD processing, it is necessary to remove this skew and 
realign the fingers of each channel. 

[0305] This is accomplished in the MUD processor by 
determining the first bit available from the most delayed 
finger and discarding all previous bits from all other fingers. 
The number of bits to discard can be individually pro- 
grammed for each finger with the Discard field of the 
MUDPARAM registers. 

[0306] This operation will typically result in a 'short' first 
slot of data. This is unavoidable when the MUD processor 
is first initialized and should not create any significant 
problems. The entire first slot of data can be completely 
discarded if 'short' slots are undesirable. 

[0307] A similar situation will arise each time processing 
is begun on a frame of data. To avoid losing data, it is 
recommended that a partial, slot of data from the previous 
frame be overlapped with the new frame. Trimming any 
redundant bits created this way can be accomplished with 
the Discard register setting or in the system DSP In order to 
limit memory requirements, the LCMUD FPGA processes 
one slot of data at a time. Doubling buffering is used for b 
and y data so that processing can continue as data is 
streamed in. Filling these buffers is complicated by the skew 
that exists among fingers in a channel. 



[0308] FIG. 13 illustrates the skew relationship among 
fingers in a channel and among the channels themselves. The 
illustrated embodiment allows for 20 us (77.8 chips) of skew 
among fingers in a channel and certain skew among chan- 
nels, however, in other embodiments these skew allowances 
vary. 

[0309] There are three related problems that are intro- 
duced by skew: Identifying frame & slot boundaries, popu- 
lating b and y tables and changing channel constants. 

[0310] Because every finger of every channel can arrive at 
a different time, there are no universal frame and slot 
boundaries. The DSP must select an arbitrary reference 
point. The data stored in b & y tables is likely to come from 
two adjacent slots. 

[0311] Because skew exists among fingers in a channel, it 
is not enough to populate the b & y tables with 2,560 
sequential chips of data. There must be some data overlap 
between buffers to allow lagging channels to access "old'' 
data. The amount of overlap can be calculated dynamically 
or fixed at some number greater than 78 and divisible by four 
(e.g. 80 chips). The starting point for each register is 
determined by the Chip Advance field of the MUDPARAM 
register. 

[0312] A related problem is created by the significant skew 
among channels. As can be seen in FIG. 13, Channel 0 is 
receiving Slot 0 while Channel 1 is receiving Slot 2. The 
DSP must take this skew into account when generating the 
b and y tables and temporally align channel data. 

[0313] Selecting an arbitrary "slot" of data from a channel 
implies that channel constants tied to the physical slot 
boundaries may change while processing the arbitrary slot. 
The Constant Advance field of the MUDPARAM register is 
used to indicate when these constants should change. 

[0314] Registers affected this way are quad-buffered. 
Before data processing begins, at least two of these buffers 
should be initialized. During normal operation, one addi- 
tional buffer is initialized for each slot processed. This 
system guarantees that valid constants data will always be 
available. 

[0315] The following two tables shown the long-code 
MUD FPGA memory map and control/status register: 



Start Addr 


End Addr 


Name 


Description 


0000_0000 


0000_0000 


CSR 


Control & Status Register 


0000_0008 


0000_000C 


DSP_UPDATE 


Route & Address for DSP updating 


0001_0000 


0001_FFFF 


MUDPARAM 


MUD Parameters 


0002_0000 


0002_FFFF 


CODE 


Spreading Codes 


0003_0000 


0004_FFFF 


BLUT 


Respread: b Lookup Table 


0005_0000 


0005_FFFF 


BETA_A 


Respread: Beta * a_hat Lookup Table 


0006_0000 


0007_FFFF 


YDEC 


MUD & MRC: y Lookup Table 


0008_0000 


0008_FFFF 


ASQ 


MUD & MRC: Sum a_Jiat squared LUT 


000A_JW00 


000A_FFFF 


OUTPUT 


Output Routes & Addresses 
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[0316] 



Bit 


31 


30 


29 28 27 


26 


25 


24 23 22 


21 


20 


19 18 


17 


16 


Name 

R/W 

Reset 


X 


X 


XXX 


X 


X 


Reserved 
RO 
XXX 


X 


X 


X X 


X 


X 


Bit 


15 


14 


13 12 11 


10 


9 


8 7 6 


5 


4. 


3 2 


1 


0 


Name 

R/W 

Reset 


X 


X 


Reserved 
RO 
XXX 


X 


X 


YB CBUF 
RO RO 
0 0 0 


Al 
RO 
0 


AO 
RO 

0 


Rl RO 
Rw Rw 
0 0 


Lst 
Rw 
0 


Rst 
Rw 
0 



[0317] The register YB indicated which of two y and b 
buffers are in use. If the system is currently not processing, 
YB indicated the buffer that will be used whin processing is 
initiated. 

[0318] CBUF indicates which of four round-robin buffers 
for MUD constants (a " beta) is currently in use. Finger skew 
will result in some fingers using a buffer one in advance of 
this indicator. To guarantee that valid data is always avail- 
able, two full buffers should be initialized before operation 
begins. 

[0319] If the system is currently not processing, CBUF 
indicates the buffer that will be used when processing is 
restarted. It is technically possible to indicate precisely 
which buffer is in use for each finger in both the Respread 
and Desprcad processing stages. However, this would 
require thirty-two 32-bit registers. Implementing these reg- 
isters would be costly, and the information is of little value. 

[0320] Al and AO indicate which y and b buffers are 
currently being processed. Al and AO will never indicate ' 1 ' 
at the same time. An indication of '0' for both Al and AO 
means that MUD processor is idle. 

[0321] Rl and R0 are writable fields that indicate to the 
MUD processor that data is available. Rl corresponds to y 
and b buffer 1 and R0 corresponds to y and b buffer 0. 
Writing a 'V into the correct register will initiate MUD 
processing. Note that these buffers follow strict round-robin 
ordering. The YB register indicates which buffer should be 
activated next. 

[0322] These registers will be automatically reset to '0' by 
the MUD hardware once processing is completed. It is not 
possible for the external processor to force a '0' into these 
registers. 

[0323] A ' 1 ' in this bit indicates that this is the last slot of 
data in a frame. Once all available data for the slot has been 
processed, the output buffers will be flushed. 

[0324] A'T in this bit will place the MUD processor into 
a reset state. The external processor must manually bring the 
MUD processor out of reset by writing a '0' into this bit. 

[0325] DSP_UPDATE is arranged as two 32-bit registers. 
ARACEway™ route to the MUD DSP is stored at address 
0x0000_0008. A pointer to a status memory buffer is 
located at address OxOO00_O00C. 

[0326] Each time the MUD processor writes a slot of 
channel data to a completion buffer, an incrementing count 



value is written to this address. The counter is fixed at 10 bits 
and will wrap around after a terminal count of 1023. 

[0327] A quad-buffered version of the MUD parameter 
control register exists for each finger to be processed. 
Execution begins with buffer 0 and continues in round-robin 
fashion. These buffers are used in synchronization with the 
MUD constants (Beta * a_hat, etc.) buffers. Each finger is 
provided with an independent register to allow independent 
switching of constant values at slot and frame boundaries. 
The following table shows offsets for each MUD channel: 



Offset 


User 


0x0000 


0 


0x0040 


1 


0x0080 


2 


0x0000 


3 


0x0100 


4 


0x0140 


5 


0x0180 


6 


0x01 CO 


7 


0x0200 


8 


0x0240 


9 


0x0280 


10 


0x0200 


11 


0x0300 


12 


0x0340 


13 


0x0380 


14 


0x0303 


15 


0x0400 


16 


0x0440 


17 


0x0480 


18 


0x0400 


19 


0x0500 


20 


0x0540 


21 


0x0580 


22 


0x0503 


23 


0x0600 


24 


0x0640 


25 


0x0680 


26 


Ox06C0 


27 


0x0700 


28 


0x0740 


29 


0x0780 


30 


0x07O0 


31 


0x0800 


32 


0x0840 


33 


0x0880 


34 


0x0800 


35 


0x0900 


36 


0x0940 


37 


0x0980 


38 


0x0900 


39 


OxOAOO 


40 
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•continued 



Offset 




User 


0x0A40 




41 


QxOASO 




42 


OxOACO 




43 


OxOBOO 




44 


OxOB40 




45 


fwnRfln 

UXUdOU 




46 


OrOBCO 




47 


OzOCOO 




48 


QxOC40 




49 


OxOC80 




SO 


OxOCCO 




51 


OxODOO 




52 


OxOD40 




53 


0x0D80 




54 


OxODCO 




SS 


OxOBOO 




56 


OxOE40 




57 


OxOE80 




58 


OxOECO 




59 


OxOPDO 




60 


0x0F40 




61 


OxOFSO 




62 


QxOFCO 




63 


[0328] The following table shows buffer ofifeets within 


each channel: 






Offset 


Finger 


Buffer 


0x0000 


0 


0 


0x0004 




1 


0x0008 




2 


OxOOOC 




3 


0x0010 


1 


0 


0x0014 




1 


0x0018 




2 


OxOOlC 




3 


. 0x0020 


2 


0 


0x0024 




1 


0x0028 




2 


0x002C 




3 


0x0030 


3 


0 


0x0034 




1 


0x0038 




2 


0x003C 




3 



[0329] The following table shown details of the control 
register 



[0330] The spread factor field determines how many chip 
samples are used to generate a data bit In the illustrated 
embodiment, all fingers in a channel have the same spread 
factor setting, however, it can be appreciated by one skilled 
in the art that such constant factor setting can be variable in 
other embodiments. The spread factor is encoded into a 3-bit 
value as shown in the following table: 



SF Factor 


Spread Factor 


000 


256 


001 


128 


010 


64 


Oil 


32 


100 


16 


101 


8 


110 


4 


111 


RESERVED 



[0331] The field specifies the sub-chip delay for the finger. 
It is used to select one of eight accumulation buffers prior to 
summing all Alpha values and passing them into the raised- 
cosine filter. 

[0332] Discard determines how many MUD -processed 
soft decisions (y values) to discard at the start of processing. 
This is done so that the first y value from each finger 
corresponds to the same bit. After the first slot of data is 
processed, the Discard field should be set to zero. 

[0333] The behavior of the discard field is different than 
that of other register fields. Once a non-zero discard setting 
is detected, any new discard settings from switching to a 
new table entry are ignored until the current discard count 
reaches zero. After the count reaches zero, a new discard 
setting may be loaded the next time a new table entry is 
accessed. 

[0334] All fingers within a channel will arrive at the 
receiver with different delays. Chip Advance is used to 
recreate this signal skew during the Respread operation. Y 
and b buffers are arranged with older data occupying lower 
memory addresses. Therefore, the finger with the earliest 
arrival time has the highest value of chip advance. Chip 
Advanced need not be a multiple of Spread Factor. 

[0335] Constant advance indicates on which chip this 
finger should switch to a new set of constants (e.g. a' ) and 
a new control register setting. Note that the new values take 



Bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 



Name Spread Factor Subchip Delay Discard 

R/W RW KW RW 

Reset X XX X XXXXXXXXXXXX 



Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Name Chip Advance Constant Advance 

R/W RW RW 

Reset X X XXXXXXXXXXXXXX 
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effect on the chip after the value stored here. For example, 
a value of 0x0 would cause the new constants to take effect 
on chip 1. A value of OxFF would cause the new constants 
to take effect on chip 0 of the next slot. The b lookup tables 
are arranged as shown in the following table. B values each 
occupy two bits of memory, although only the LSB is 
utilized by LCMUD hardware. 



Offset Buffer 


0x0000 




UO BO 


0x0020 




Ul BO 


0x0040 




UO Bl 


0x0060 




Ul Bl 


0x0080 




U2 BO 


OxOOAO 




U3 BO 


OxOOCO 




U2 Bl 


OxOOEO 




U3 Bl 


0x0100 




U4 BO 


0x0120 




U5 BO 


0x0140 




U4 Bl 


0x0160 




U5 Bl 


0x0180 




U6B0 


OxOlAO 




U7 BO 


OxOICO 




U6 Bl 


0x01 EO 




U7 Bl 


0x0200 


s 


U8 BO 


0x0220 




U9 BO 


0x0240 




U8 Bl 


0x0260 ' 




U9 Bl 


0x0280 




U10 


BO 


0x02A0 




Ull 


BO 


0x02C0 




U10 


Bl 


Qx02EO 




Ull 


Bl 


0x0300 




U12 


BO 


0x0320 




U13 


BO 


0x0340 




U12 


Bl 


0x0360 




U13 


Bl 


0x0380 




U14 


BO 


0x03A0 




U15 


BO 


0x03C0 




U14 


Bl 


Ox03EO 




U15 


Bl 


0x0400 




U16 


BO 


0x0420 




U17 


BO 


0x0440 




U16 


Bl 


0x0460 




U17 


Bl 


0x0480 




U18 


BO 


Qx04A0 




U19 


BO 


0x04C0 




U18 


Bl 


0x04E0 




U19 


Bl 


0x0500 




U20 


BO 


0x0520 




U21 


BO 


0x0540 




U20 


Bl 


0x0560 




U21 


Bl 


0x0580 




U22 


BO 


OxO5A0 




U23 


BO 


0x05C0 




U22 


Bl 


0x05E0 




U23 


Bl 


0x0600 




U24 


BO 


0x0620 




U25 


BO 


0x0640 




U24 


Bl 


0x0660 




U25 


Bl 


0x0680 




U26 


BO 


Qx06A0 




U27 


BO 


0x0 6C0 




U26 


Bl 


0x06EO 




U27 


Bl 


0x0700 




U28 


BO 


0x0720 




U29 


BO 


0x0740 




U28 


Bl 


0x0760 




U29 


Bl 


0x0780 




U30 


BO 


0x07A0 




U31 


BO 


Ox07C0 




U30 


Bl 


Ox07EO 




U31 


Bl 


0x0800 




U32 


BO 



-continued 



Offset 


Buffer 


0x0820 


U33B0 


0x0840 


U32 Bl 


0x0860 


U33B1 


0x0880 


U34B0 


Ox08AO 


U35B0 


0x0803 


U34 Bl 


Ox08EO 


U35 Bl 


0x0900 


U36 BO 


0x0920 


U37B0 


0x0940 


U36 Bl 


0x0960 


U37B1 


0x0980 


U38 BO 


Ox09A0 


U39 BO 


0x0900 


U38 Bl 


Ox09E0 


U39B1 


OxOAOO 


U40B0 


OxOA20 


U41 BO 


OxOA40 


U40 Bl 


OxOA60 


U41 Bl 


QxOA80 


U42B0 


OxOAAO 


U43B0 


OxOACO 


U42B1 


OxOAEO 


U43 Bl 


OxOBOO 


U44 BO 


OxOB20 


U45 BO 


OxOB40 


U44B1 


OxOB60 


U45 Bl 


OxOB80 


U46B0 


OxOBAO 


U47B0 


OxOBCO 


U46B1 


OxOBEO 


U47B1 


OxOCOO 


U48B0 


OxOC20 


U49 BO 


OxOC40 


U48 Bl 


OxOC60 


U49 Bl 


OxOCSO 


U50B0 


OxOCAO 


U51 BO 


OxOCCO 


U50 Bl 


OxOCEO 


U51 Bl 


OxODOO 


U52 BO 


OxOD20 


U53 BO 


Ox0D40 


U52 Bl 


OxOD60 


U53 Bl 


OxOD80 


U54 BO 


OxODAO 


U55 BO 


OxODCO 


U54 Bl 


OxODEO 


U55 Bl 


OxOEOO 


U56 BO 


QxOE20 


U57 BO 


QxOE40 


U56 Bl 


OxOESO 


U57 Bl 


OxOE80 


U58 BO 


OxOEAO 


U59 BO 


OxOECO 


U58 Bl 


OxOEEO 


U59 Bl 


OxOFOO 


U60 BO 


OxOF20 


U61 BO 


OxOF40 


U60 Bl 


OxOF60 


U61 Bl 


OxOF80 


U62 BO 


OxOFAO 


U63B0 


OxOFOO 


U62 Bl 


OxOFEO 


U63 Bl 



[0336] The following table illustrates how the two-bit 
values are packed into 32-bit words. Spread Factor 4 chan- 
nels require more storage space than is available in a single 
channel buffer. To allow for SF4 processing, the buffers for 
an even channel and the next highest odd channel are joined 
together. The even channel performs the processing while 
the odd channel is disabled. 
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Bit 


31 30 


29 28 


27 26 


25 24 


23 22 


21 20 


19 18 17 16 


Name 


b(0) 


b(D 


b(2) 


b(3) 


b(4) 


b(5) 


b(6) b(7) 


Bit 


15 14 


13 12 


11 10 


9 8 


7 6 


5 4 


3 2 1 0 


Name 


b(8) 


b(9) 


b(10) 


b(ll) 


b(12) 


b(13) 


b(14) b(!5) 



[0337] The beta*a-hat table contains the amplitude esti- 
mates for each finger pre-multiplied by the value of Beta. 
The following table shows the memory mappings for each 
channel. 



Offset 


User 


0x0000 


0 


0x0080 


1 


0x0100 


2 


0x0180 


3 


0x0200 


4 


0x0280 


5 


0x0300 


6 


0x0380 


7 


0x0400 


8 


0x0480 


9 


0x0500 


10 


0x0580 


11 


0x0600 


12 


0x0680 


13 


0x0700 


14 


0x0780 


15 


0x0800 


16 


0x0880 


17 


0x0900 


18 


0x0980 


19 


OxOAOO 


20 


OxOA80 


21 


0x0 BOO 


22 


OxOB80 


23 


0x0030 


24 


OxOC80 


25 


0x0 DOO 


26 


OXOD80 


27 


OxOEOO 


28 


OxOE80 


29 


OxOFOO 


30 


0x0F80 


31 


0x1000 


32 


0x1080 


33 


0x1100 


34 


0x1180 


35 


0x1200 


36 


0x1280 


37 


0x1300 


38 


0x1380 


39 


0x1400 


40 


0x1480 


41 


0x1500 


42 


0x1580 


43 


0x1600 


44 


0x1680 


45 


0x1700 


46 


0x1780 


47 


0x1800 


48 


0x1880 


49 


0x1900 


50 


0x1980 


51 


OxlAOO 


52 


OxlA80 


53 


0x1 BOO 


54 



-continued 



Offset 




User 


0xlB80 




55 


Oxl COO 




56 


0x1 C80 




57 


OxlDOO 




58 


0xlD80 




59 


OxlEOO 




60 


0xlE80 




61 


OxlFOO 




62 


0xlF80 




63 


[0338] The following table shows buffers that are distrib- 


uted for each channel: 






Offset 




User Buffer 


0x00 




0 


0x20 




1 


0x40 




2 


0x80 




3 


[0339] The following table shows a memory mapping for 


individual fingers of each antenna. 




Offset 


Finger 


Antenna 


0x00 


0 


0 


0x04 


1 




0x08 


2 




OxOC 


3 




0x10 


0 


1 


0x14 


1 




0x18 


2 




OxlC 


3 




[0340] The y (soft decisions) table contains two buffers for 


each channel. Like the b lookup table, an even and odd 


channel are bonded together to process SF4. Each y data 


value is stored as a byte. The data is written into the buffers 


as packed 32-bit words. 






Offset 




Buffer 


0x0000 




UOBO 


0x0200 




Ul BO 


0x0400 




U2B1 


0x0600 




U3 Bl 
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Onset 




0x0800 


UO B0 


OxOAOO 


Ul B0 


0*0 COO 


U2 Bl 


OxOEOO 


U3 Bl 


nxoonn 


U4 B0 


0x0200 


US BO 


UX.UH\J\J 


U6 Bl 


0x0600 


U7 Bl 


0x0800 


U4 B0 




U5 B0 


Oxf) (DO 


U6 Bl 


oxOHnn 


U7 Bl 


oxnono 


U8 B0 


OxfWnO 


U9 B0 


nxfWIfl 

UAU**V»U 


U10 Bl 


OxOfiOO 


Ull Bl 


0x0800 


U8 B0 


Hrf) AOO 


U9 B0 


Oxfl (DO 


U10 Bl 


OrORTlO 


Ull Bl 


0x0000 


U12 BO 


0x0100 


U13 BO 


0x0400 


U14 Bl 


0x0600 


U1S Bl 


0x0800 


U12 B0 


OxO Ann 


U13 B0 


Oxomn 


U14 Bl 


OxOFOO 


U15 Bl 


0x4000 


U16 BO 


0x4200 


U17 BO 


0x4400 


U18 Bl 


0x4600 


U19 Bl 


0x4800 


U16 B0 


0x4A00 


U17 BO 


Ox4CX)0 


U18 Bl 


0x4E00 


U19 Bl 


Hr*»nnn 


Tion nn 


Ox 5 200 


U21 B0 




U22 Bl 


nxS^no 


U23 Bl 


UXDOUU 


U20 B0 


n»< aha 

UXDAUU 


U21 B0 


0x5 G00 


U22 Bl 




U23 Bl 


uxouuu 


U24 BO 




U25 B0 


UXO*fUU 


U26 Bl 


uxoouu 


U27 Bl 


UXuoUU 


uza du 


rw^ Ann 

UXOAUU 


U25 BO 


0x6000 


U26 Bl 


uxoriuu 


HOT R1 


UX /UUU 


r mo Tan 
UZo DU 


UX / 4UU 


U29 B0 


0x7400 


UJU D J 


UX /OUU 


U31 Bl 


0x7800 


U28 BO 


0x7AO0 


U29 BO 


0x7000 


U30 Bl 


0x7E00 


U31 Bl 


0x8000 


U32 BO 


0x8200 


U33 BO 


0x8400 


U34 Bl 


0x8600 


U35 Bl 


0x8800 


U32B0 


0x8A00 


U33 BO 


0x8O00 


U34 Bl 


0x8E00 


U35 Bl 


0x9000 


U36 B0 


0x9200 


U37 BO 


0x9400 


U38 Bl 


0x9600 


U39 Bl 


0x9800 


U36 BO 


0x9 A00 


U37 BO 
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-continued 



Offset 


Buffer 


0x9000 


U38B1 


0x9E00 


U39 Bl 


OxAOOO 


U40BO 


0xA200 


U41 BO 


0xA400 


U42 Bl 


0xA600 


U43 Bl 


OxASOO 


U40 BO 


QxAAOO 


U41 BO 


OxACOO 


U42B1 


OxAEOO 


U43 Bl 


OxBOOO 


U44 BO 


0xB200 


U45B0 


0xB400 


U46 Bl 


OxB600 


U47 Bl 


OXB800 


U44 BO 


QxBAOO 


U45 BO 


OxBCOO 


U46 Bl 


OxBEOO 


U47 Bl 


OxCOOO 


U48 BO 


0xC200 


U49 BO 


0xC400 


U50 Bl 


0xC600 


U51 Bl 


0xC800 


U48 BO 


OxCAOO 


U49 BO 


OxCOOO 


U50 Bl 


OxCEOO 


U51 Bl 


OxDOOO 


U52 BO 


0xD200 


U53 BO 


0xD400 


U54 Bl 


0xD600 


U55 Bl 


0xD800 


U52 BO 


OxDAOO 


U53 BO 


OxDOOO 


U54 Bl 


OxDEOO 


U55 Bl 


OxEOOO 


U56 BO 


0xE200 


U57 BO 


0xE400 


U58 Bl 


0xE600 


U59 Bl 


0xE800 


U56 BO 


OxEAOO 


U57 BO 


OxEOOO 


U58 Bl 


OxEEOO 


U59 Bl 


OxFOOO 


U60B0 


0xF200 


U61 BO 


0xF400 


U62 Bl 


0xF600 


U63 Bl 


0xF800 


U60 BO 


OxFADO 


U61 BO 


OxFCOO 


U62 Bl 


OxFEOO 


U63 Bl 


[0341] The sum of the a-hat squares is stored as a 16-bit 


value. The following table contains a memory address 


mapping for each channel. 




Offset 


User 


0x0000 


0 


0x0020 


1 


0x0040 


2 


0x0060 


3 


0x0080 


4 


OxOOAO 


5 


OxOOCO 


6 


OxOOEO 


7 


0x0100 


8 


0x0120 


9 


0x0140 


10 


0x0160 


11 
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-continued 



Offset 


User 


0x0180 


12 


0x0 1A0 


13 


0x0 ICO 


14 


0x01 E0 


15 


0x0200 


16 


0x0220 


17 


0x0240 


18 


■ 0x0260 


19 


0x0280 


20 


0x02A0 


21 


0x0 2C0 


22 


0x02EO 


23 


0x0300 


24 


0x0320 


25 


0x0340 


26 


0x0360 


27 


0x0380 


28 


0x03A0 


29 


0x03C0 


30 


0x03E0 


31 


0x0400 


■ 32 


0x0420 


33 


0x0440 


34 


0x0460 


35 


0x0480 


36 


0xO4A0 


37 


0x04C0 


38 


0x04E0 


39 


0x0500 


40 


0x0520 


41 


0x0540 


42 


0x0560 


43 


0x0580 


44 


0x05A0 


45 


0x05CO 


46 


0x05E0 


47 


0x0600 


48 


0x0620 


49 


0x0640 


50 


0x0660 


51 


0x0680 


52 


0x0 6 AO 


53 


0xO6CO 


54 


0x06EO 


55 


0x0700 


56 


0x0720 


57 


0x0740 


58 


0x0760 


59 


0x0780 


60 


Ox07AO 


61 


0xO7CO 


62 


0x07E0 


63 


[0342] Within each buffer, the value for antenna 0 is stored 


at address offset 0x0 with the value for antenna one stored 


at address offset 0x04. The following table demonstrates a 


mapping for each finger. 




Offset 


User Buffer 


0x00 


0 


0x08 


1 


0x10 


2 


OxlC 


3 



[0343] Each channel is provided a RACEway™ route on 
the bus, and a base address for buffering output on a slot 
basis. Registers for controlling buffers are allocated as 



shown in the following two tables. External devices are 
blocked from writing to register addresses marked as 
reserved. 



Offset 


User 


0x0000 


0 


0x0020 


1 


0x0040 


2 


0x0060 


. 3 


0x0080 


4 


OxOOAO 


5 


OxOOCO 


6 


OxOOEO 


7 


0x0100 


8 


0x0120 


9 


0x0140 


10 


0x0160 


11 


0x0180 


12 


0x0 IAD 


13 


0x0 ICO 


14 


0x01 E0 


15 


0x0200 


16 


0x0220 


17 


0x0240 


18 


0x0260 


19 


0x0280 


20 


0x02A0 


21 


0x02C0 


22 


Ox02EO 


23 


0x0300 


24 


0x0320 


25 


0x0340 


26 


0x0360 


27 


0x0380 


28 


0x03A0 


29 


0x03CO 


30 


0x03E0 


31 


0x0400 


32 


0x0420 


33 


0x0440 


34 


0x0460 


35 


0x0480 


36 


0xO4AO 


37 


0x0 4C0 


38 


0x04E0 


39 


0x0500 


40 


0x0520 


41 


0x0540 


42 


0x0560 


43 


0x0580 


44 


0xO5AD 


45 


0x05C0 


46 


0x05E0 


47 


0x0600 


48 


0x0620 


49 


0x0640 


50 


0x0660 


51 


0x0680 


52 


0x0 6A0 


53 


0x06C0 


54 


0x06E0 


55 


0x0700 


56 


0x0720 


57 


0x0740 


58 


0x0760 


59 


0x0780 


60 


0x07A0 


61 


Qx07CO 


62 


0x07E0 


63 



[0344] Slot buffer size is automatically determined by the 
channel spread factor. Buffers are used in round-robin fash- 
ion and all buffers for a channel must be arranged contigu- 
ously. The buffers control register determines how many 
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buffers are allocated for each channel. A setting of 0 indi- 
cates one available buffer, a setting of 1 indicates two 
available buffers, and so on. 

Methods for Estimating Symbols Embodied in 
Short-Code User Waveforms 

[0345] As discussed above, systems according to the 
invention perform multi-user detection by determining cor- 
relations among the user channel-corrupted waveforms and 
storing these correlations as elements of the R-matrices. The 
correlations are updated in real time to track continually 
changing channel characteristics. The changes can stem 
from changes in user code correlations, which depend on the 
relative lag among various user multi-path components, as 
well as from the much faster variations of the Rayleigh- 
fading multi-path amplitudes. The relative lags among 
multi-path components can change with a time constant, for 
example, of about 400 ms whereas the multi-path ampli- 
tudes can vary temporally with a time constant of, for 
example, 1.33 ms. The R-matrices are used to cancel the 
multiple access interference through the Multi-stage Deci- 
sion-Feedback Interference Cancellation (MDFIC) tech- 
nique. 

[0346] In the preceding discussion and those that follow, 
the term physical user refers to a CDMA signal source, e.g., 
a user cellular phone, modem or other CDMA signal source, 
the transmitted waveforms from which are processed by a 
base station and, more particularly, by MUD processing card 
118. In the illustrated embodiment, each physical user is 
considered to be composed of a one or more virtual users 
and, more typically, a plurality of virtual users. 

[0347] A virtual user is deemed to "transmit" a single bit 
per symbol period, where a symbol period can be, for 
example, a time duration of 256 chips (1/15 ms). Thus, the 
number of virtual users, for a given physical user, is equal to 
the number of bits transmitted in a symbol period. In the 
illustrated embodiment, each physical user is associated 
with at least two virtual users, one of which corresponds to 
a Dedicated Physical Control Channel (DPCCH) and the 
other of which corresponds to a Dedicated Physical Data 
Channel (DPDCH). Other embodiments may provide for a 
single virtual user per physical user, as well, of course, to 
three or more virtual users per physical user. 

[0348] In the illustrated embodiment, when a Spreading 
Factor (SF) associated with a physical user is less than 256, 
the J«»256/SF data bits and one control bit are transmitted per 
symbol period. Hence, for the I th physical user with data- 
channel spreading factor SF r ,there are a total of l+256/SF r 
virtual users. The total number of virtual users can then be 
denoted by: 




[0349] The waveform transmitted by the rth physical user 
can be written as: 



iw, (2) 
x r [t] = V/3t£*['-mriM»] 

N-l 



[0350] where t is the integer time sample index, T-NN C 
represents the data bit duration, N-256 represents short- 
code length, N c is the number of samples per chip, and where 
P k -P c if the kth virtual user is a control channel and k -p d if 
the kth virtual user is a data channel. The multipliers P c and 
P d are utilized to select the relative amplitudes of the control 
and data channels. In the illustrated embodiment, at least one 
of the above constants equals 1 for any given symbol period, 
m. 

[0351] The waveform sk[t], which is herein referred to as 
the transmitted signature waveform for the kth virtual user, 
is generated by the illustrated system by passing the spread 
code sequence ck[n] through a root-raised-cosine pulse 
shaping filter h[t]. When the kth virtual user corresponds to 
a data user with a spreading factor that is less than 256, the 
code ck[n] retains a length of 256, but only Nk of these 256 
elements are non-zero, where Nk is the spreading factor for 
the kth virtual user. The non-zero values are extracted from 
the code C chj256>64 [n]-S Sh [n]. 

[0352] The baseband received signal can be written as: 



rl ' ]= ZZ St[ '- mm[ml+nW 

i-J m 

L 



[0353] where w[t] is receiver noise, SJt] is the channel- 
corrupted signature waveform for virtual user k, L is the 
number of multipath components, and a^, are the complex 
multipath amplitudes. The amplitude ratios ^ are incorpo- 
rated into the amplitudes a kq ,. If k and 1 are two virtual users 
that correspond to the same physical user then, aside from 
scaling factors p k and pp,^ and a lq , are equal. This is due 
to the fact that the signal waveforms of all virtual users 
corresponding to the same physical user pass through the 
same channel. Further, the waveform sjt] represents the 
received signature waveform for the k" 1 virtual user, and it 
differs from the transmitted signature waveform given in 
Equation (2) in that the root-raised-cosine pulse h[t] is 
replaced with the raised-cosine pulse g[t]. 

[0354] The received signal that has been match-filtered to 
the chip pulse is also match-filtered in the illustrated 
embodiment to the user code sequence in order to obtain 
detection statistic, herein referred to as y k , for the k th virtual 
user. Because there are K v codes, there are K v such detec- 
tion statistics. For each virtual user, the detection statistics 
can be collected into a column vector y[m] whose 111 th entry 
corresponds to the m" 1 symbol period. More particularly, the 
matched filter output yjm] for the 1 th virtual user can be 
written as: 
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yAm] 



(4) 



[0355] where S lq * is an estimate of a, q *, x lq is an estimate 
of x lq , N, is the (non-zero) length of code cjn], and Tj^m] 
represents the match- filtered receiver noise. Substituting the 
expression for r[t] from Equation (3) in Equation (4) results 
in the following equation: 



y/M ■ / , R& 



L 

J]^^X Sk[nNc+ ^ +m,T] ' 



(5) 



clln) 



>bt[m - m') + m[m} 



= 2 Zj rfl ^ m> ^* t m " m 'l + 

Li , 

= ^ ^ *«{ a ; a w ' w, £ 54 lnNe + m ' r+ 

L L i 



[0356] The terms for m'=0 result from asynchronous users. 

Calculation of the R -matrix 

[0357] Determination of the R-matrix elements defined by 
Equation (5) above can be divided into two or more separate 
calculations, each having an associated time constant or 
period of execution corresponding to a time constant or 
period during which a corresponding characteristic of the 
user waveforms are expected to change in real time. In the 
illustrated embodiment, three sets of calculations are 
employed as reflected in the following equations: 



L L , 



(6) 



-continued 

L L 

Corf m a 2^rZ Yj g[{n ~ p)Nc + ^ T + Tf * " v ]Ci [rt * c ' 



= ^ ^iMl + mT + - t^* ]^ c t fo - m] 



[0358] where the hats (*), indicating parameter estimates, 
have been omitted. 

[0359] With reference to Equation (6), the r-matrix, 
whose elements vary with the slowest time constant, repre- 
sents the user code correlations for all values of offset m. For 
the case of 100 voice users, the total memory requirement 
for storing the r-matrix elements is 21 Mbytes based on two 
bytes (e.g., the real and imaginary parts) per element. In the 
illustrated embodiment, the r-matrix matrix is updated only 
when new codes associated with new users are added to the 
system. Hence, the r-matrix is effectively a quasi-static 
matrix, and thus, its computational requirements are mini- 
mal. 

[0360] The selection of the most efficient method for 
calculating the r-matrix elements depends on the non-zero 
length of the codes. For example, the non-zero length of the 
codes in case of high data-rate users can be only 4 chips 
long. In such a case, a direct convolution, e.g., convolution 
in the time domain, can be the most efficient method of 
calculating the elements of the r-matrix. For low data-rate 
users, it may be more efficient to calculate the elements of 
the r-matrix by utilizing Fast Fourier Transforms (FFTs) to 
perform convolutions in the frequency domain. 

[0361] In one method according to the teachings of the 
invention, the C-matrix elements are calculated by utilizing 
the r-matrix elements. The C-matrix elements need to be 
calculated upon occurrence of a change in a user's delay lag 
(e.g.; time-lag). For example, consider a case in which each 
multi-path component changes on average every 400 ms, 
and the length of the g[] function is 48 samples. In such a 
case, assuming an over-sampling by four, then forty-eight 
operations per element need to be performed (for example, 
12 multiple accumulations, real x complex, for each ele- 
ment). Further, if 100 low-rate users (i.e., 200 virtual users) 
are utilizing the system, and assuming a single multipath lag 
of four changes for one user, a total of (1.5)(2)K^N, 
elements need to be calculated. The factor of 1.5 arises from 
the three C-matrices (e.g., m'«-l,0,l) which is reduced by a 
factor two as a result of a conjugate symmetry condition. 
Moreover, the factor two arises based on the fact that both 
rows and columns need to be updated. The factor N v 
represents the number of virtual users per physical user, 
which for the lowest rate users is N v «2 as stated above. In 
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total, this amounts to approximately 230,400 operations per 
multipath component per physical user. Accordingly, it gives 
rise to 230 MOPS based on 100 physical users with four 
multipath components per user, each changing once per 400 
msec. Of course, in other embodiments these values can 
differ. 

[0362] The C-matrices are then utilized to calculate the 
R-matrices. More particularly, the elements of the R-matrix 
can be obtained as follows by utilizing Equation (6) above: 



l l (7) 

'aM = ZX *e%<V ■ C <W W 1) = ■ C a [m'J ■ 4} 



[0363] where a k are Lxl vectors, and QJm'] are LxL 
matrices. The rate at which the above calculations need to be 
performed depends on the velocity of the users. For 
example, in one embodiment, the update rate is selected to 
be 1.33 msec. An update rate that is too slow such that the 
estimated values of the R-matrix deviate significantly from 
the actual R-matrix values results in a degradation of the 
MUD efficiency. For example, FIG. 14 presents a graph that 
depicts the change in the MUD efficiency versus user 
velocity for an update rate of 1.33 msec, which corresponds 
to two WCDMA time slots. This graph indicates that the 
MUD efficiency is high for users having velocities that are 
less than about 100 km/h. The graph further shows that the 
interference corresponding to fast users is not canceled as 
effectively as the interference corresponding to slow users. 
Thus, for a system that is utilized by a mix of fast and slow 
users, the total MUD efficiency is an average of the MUD 
efficiency for the range of user velocities. Utilizing the 
above Equation (7), the R-matrix elements can be calculated 
in terms of an X matrix that represents amplitude-amplitude 
multiplies as shown below: 

r n Jm>ife{/^a.^Caw^^ll-^eW^/»l«va I K ] 
"^btl 



C v J i m'hCj t v im'^jC 1]t \m f ] 



(8) 



[03 64] The use of the X-matrix as illustrated above advan- 
tageously allows reusing the X-matrix multiplies for all 
virtual users associated with a physical user and for all m' 
(i.e., m=0, 1). The remaining calculations can be expressed 
as a single real dot product of length 2L2=32. The calcula- 
tions can be performed, for example, in 16-bit fixed point 
math. Then, the total operations can amount to 
1.5(4)(K v L)2t»3.84 MOPS resulting in a processing require- 
ment of 2.90 GOPS. The X-matrix multiplies, when amor- 
tized, amount to an additional 0.7 GOPS. Thus, the total 
processing requirement can be 3.60 GOPS. 

[0365] The matched-filter outputs can be obtained from 
the above Equation (5) as follows: 



y t [m) = nfflWM +2/*[-i]**[m + 11 + 



(9) 



-continued 

£ [r tt [0] - r«[01<5 ft ]b k [m] + [l]fr t [m - 1 ] +*N 



[03 6 6] wherein the first term represents a signal of interest, 
and the remaining terms represent Multiple Access Interfer- 
ence (MAI) and noise. The illustrated embodiment uses a 
Multistate Decision Feedback interference Cancellation 
(MDFIC) algorithm can be utilized to solve for the symbol 
estimates in accord with the following relationship: 



b t [m] = Bgnjjrfm] - ^ r tt [- 1 )h k [m + J] - 



(10) 



f>u[0]-r tf [0}*aMm] 



-£r tt [l]Mm-l]i 
t=l J 



[0367] with initial estimates given by hard decisions on 
the matched-filter detection statistics, 

6jm>wgn{ft[ffi]}. 

[0368] A further appreciation of these and alternate 
MDFIC techniques may be attained by reference to An 
MDFIC technique which is described in an article by T. R. 
Giallorenzi and S. G. Wilson, titled, "Decision feedback 
multi-user receivers for asynchronous CDMA systems", 
published in IEEE Global Telecommunications Conference, 
pages 1677-1682 (June 1993), and herein incorporated by 
reference. Related techniques, known as , is closely related 
to Successive Interference Cancellation (SIC) and Parallel 
Interference Cancellation (PIC), can be used in addition or 
instead. 

[0369] In the illustrated embodiment, the new estimates 
^[m] are immediately introduced back into the interference 
cancellation as they are calculated. Hence at any given 
cancellation step, the best available symbol estimates are 
used. In one embodiment, the above iteration can be per- 
formed on a block of 20 symbols, which represents two 
WCDMA time slots. The R-matrices are assumed to be 
constant over this period. The sign detector in Equation (10) 
above can be replaced by a hyperbolic tangent detector to 
improve performance under high input BER. A hyperbolic 
tangent detector has a single slope parameter which varies 
from one iteration to another. 

[0370] The three R-matrices (R[-l], R[0] and R[l]) are 
each KyXKy in size. Hence, the total number of operation per 
iteration is 61^. The computational complexity of the 
MDFIC algorithm depends on the total number of virtual 
users, which in turn depends on the mix of users at various 
spreading factors. For 1^=200 users (e.g. 100 low-rate 
users), the computation requires 240,000 operations. In one 
embodiment, two iterations are employed which require a 
total of 480,000 operations. For real-time applications, these 
operations must be performed in Vis ms or less. Thus, the 
total processing requirement is 7.2 GOPS. Computational 
complexity is markedly reduced if a threshold parameter is 
set such that IC is performed only for those \y£m]\ below the 
threshold. If \y{m]\ is large, there is little doubt as to the sign 
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of bj£m], and IC need not be performed. The value of the 
threshold parameter can be variable from stage to stage. 

C-Matrix Calculation 

[0371] As discussed above, the C-matrix elements are 
utilized to calculate the R-matrices, which in turn are 
employed by an MDF Interference Cancellation routine. The 
C-matrix elements can be calculated by utilizing different 
techniques, as described elsewhere herein. In one approach, 
the C-matrix elements are calculated directly whereas in 
another approach the C-matrix elements are computed from 
the r-matrix elements, as discussed in detail below and 
illustrated elsewhere herein. 

[0372] More particularly, in one method for calculating 
the C matrix elements, each C-matrix element can be 
calculated as a dot product between the kth user's waveform 
and the 1th user's code stream, each offset by some multipath 
delay. For this method of calculation, each time a user's 
multipath profile changes, all the C-matrix elements asso- 
ciated with the changed profile need to be recalculated. A 
user's profile can change very rapidly, for example, every 
100 msec or faster, thereby necessitating frequent updates of 
the C-matrix elements. Such frequent updates of the C-ma- 
trix elements can give rise to a large amount of overhead 
associated with computations that need to be performed 
before obtaining each dot product. In fact, obtaining the 
C-matrix elements by the above approach may require 
dedicating an entire processor for performing the requisite 
calculations. 

[0373] Another approach according to the teachings of the 
invention for calculating the C-matrix elements pre-calcu- 
lates the code correlations up-front when a user is added to 
the system. The calculations are performed over all possible 
code offsets and can be stored, for example, in a large array 
(e.g., approximately 21 Mbytes in size), herein referred to as 
the r-matrix. This allows updating C-matrix elements when 
a user's profile changes by extracting the appropriate ele- 
ments from the Gamma matrix and performing minor cal- 
culations. Since the r-matrix elements are calculated for all 
code offsets, FFT can be effectively employed to speed up 
the calculations. Further, because all code offsets are pre- 
calculated, rapidly changing multipath profiles can be 
readily accommodated. This approach has a further advan- 
tage in that it minimizes the use of resources that need to be 
allocated for extracting the C-matrix elements when the 
number of users accessing system is constant. 

C-matrix Elements Expressed in Terms of Code 
Correlations 

[0374] As discussed above, the R-matrix elements can be 
given in terms of the C-matrix elements as follows: 



L L (11) 



[0375] where C^-fin'] is a five -dimensional matrix of 
code correlations. Both 1 and k range from 1 to K„, where K„ 



is the number of virtual users. The indices q and q' range 
from 1 to L, representing the number of multipath compo- 
nents, which in this exemplary embodiment is assumed to be 
4. The symbol period offset m' ranges from -1 to 1. The total 
number of matrix elements to be calculated is then 
N C =3(K V L) 2 «3(800) 2 =1.92M complex elements, requiring 
3.84 MB of storage if each element is a byte. The following 
symmetry property of the C-matrix elements can be utilized 
to halve the storage requirement, for example, in this case to 
1.92 MB: 



C lLiq ' q [-ni , ]=—q ]unl [m'] 



[037 6] It is evident from the above Equation (12) that each 
element of Cu^fm'] is formed as a complex dot product 
between a code vector Cj and a waveform vector s^.. In this 
exemplary embodiment, the length of the code vector is 256. 
The waveform sjt], herein referred to as the signature 
waveform for the kth virtual user, is generated by applying 
a pulse-shaping filter g[t] to the spread code sequence cjn] 
as follows: 



[0377] where N=256 and g[t] is the raised-cosine pulse 
shape. Since g[t] is a raised-cosine pulse as opposed to a 
root-raised-cosine pulse, the signature waveform sjt] 
includes the effects of filtering by the matched chip filter. For 
spreading factors less than 256, some of the chips cJjj] are 
zero. The length of the waveform vector sjt] is Lg+255^ 
where L g is the length of the raised-cosine pulse vector g[t] 
and N e is the number of samples per chip. The values for 
these parameters in this exemplary embodiment are selected 
to be L g =48 and N c =4. The length of the waveform vector 
is then 1068, but for performing the dot product, it is 
accessed at a stride of N c =4, which results in an effective 
length of 267. 

[0378] In this exemplary embodiment, the raised-cosine 
pulse vector g[t] is defined to be non-zero from t=-Lg/2+ 
l:Lg/2, with g[0]-l. With this definition the waveform sjt] 
is non-zero in a range from t=-L g /2+l:L g /2+255N c . 

[0379] By combining Equations (11) and (13), the calcu- 
lation of the C-matrix elements can be expressed directly in 
terms of the user code correlations. These correlations can 
be calculated up front and stored, for example, in SDRAM. 
The C-matrix elements expressed in terms of the code 
correlations T lk [m] are: 

= W H H * [{n ~ p)Nc + m> T + " 1 * Ck W ' C ' w 
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[0380] Since the pulse shape vector g[n] is of length L g , at 
most 2Lg/N c -24 real macs need to be performed to calculate 
each element Cu^fm'] (the factor of 2 arises because the 
code correlations r^m] are complex). For a given t, the 
method of the invention efficiently calculates the range of m 
values for which g[mN c +t] is non-zero as described below. 
The minimum value of m is given by m minl N c +T--L g /2+l, 
and x is given by tom'NN^-t^,. If each x value is 
decomposed as T^-n^+p^, then m minl =ceil[(-T-L g /2+ 
^/NJ-m'N-n^+n^-L^N^+cent^^-p^+^/NJ, 
where ceil[(p kq .-p 1 q+l)/N c ] will be either 0 or 1. It is 
convenient to set this value to 0. In order to avoid accessing 
values outside the allocation for g[n], g[n]=0.0 for n*»-Lg/ 
2:-Lg/2-(N c -l). All but one of the N c 2 possible values for 
ceil[5>kq-Pi g +l)/NJ are °- 
[0381] Accordingly, the following relation holds: 



(15) 

and Lg/ ( 2NJ is a 



floo: 



[0382] wherein L g is divisible by 2N^ 
system constant. 

[0383] Since, the maximum value of m is given by 
m^^+ToLg/2, the following holds: 

■floor[(-T+V2)/Ay-- ffl'JV+»i q w l5B ^ 8 /(2N c )+ 

[0384] Further, flooiKp^-p^/NJ can be either -1 or 0. 
In this exemplary embodiment, it is convenient to set this 
value to 0. In order to avoid accessing values outside the 
allocation for g[n], g[n] is set to 0.0 (g[n}*0.0) for n»-L g / 
2+1:1^/2+1^,. It is noted that half of the N c 2 possible values 
for floor[(p ka> -p lq )N c ] are 0. Accordingly, the following 
relation holds: 

m^—m'N-^+n^+LJVNJ (1 6) 

[0385] The values of m^j and m^j are quickly calcu- 
lable. 

[0386] The calculation of the C-matrix elements typically 
requires a small subset of the T matrix elements. The T 
matrix elements can be calculated for all values of m by 
utilizing Fast Fourier Transform (FFT) as described in detail 
below. 

Using FFT to Calculate the r-matrix Elements 

[0387] It was shown above that the r-matrix elements can 
be represented as a convolution. Accordingly, the FFT 
convolution theorem can be exploited to calculate the r-ma- 
trix elements. X From the above Equation (14), the r-matrix 
elements are defined as follows: 



T tt [m] = — £ c][n\ ■ c k [n -m] 

' n=0 



(17) 



[0388] where N=256. Three streams are related by this 
equation. In order to apply the convolution theorem, these 
three streams are defined over the same time interval. The 
code streams cjn] and Cj[n] are non-zero from n«0:255. 
These intervals are based on the maximum spreading factor. 
For higher data-rate users, the intervals over which the 
streams are non-zero are reduced further. The intervals 
derived from the highest spreading factor are of particular 
interest in defining a common interval for all streams 
because they represent the largest intervals. The common 
interval allows the FFTs to be reused for all user interactions. 

[0389] With reference to FIG. 15, the range of values of 
m for which T Ik [m] is non-zero can be derived from the 
above intervals. The maximum value of m is limited by 
n-m^0, which gives 



255-m m . x =0=>m mi «=255 



(18) 



[0390] and the minimum value of m is limited by 
n-m^255, which gives 

0-m min =255->m mta =- 255 (19) 

[0391] To achieve a common interval for all three streams, 
an interval defined by m=-M/2: M/2-1, M=512 is selected. 
The streams are zero-padded to fill up the interval, if needed. 

[0392] Accordingly, the DFT and IDFT of the streams are 
given by the following relations: 



c,[r] 



M 
~T 

1 ^ 

din] = - £ C,[r] eJW" 



(20) 



[0393] which gives 



£ c k [n-m] c}[n] 

M -l U -l M 
M M 



(21) 
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[0394] Hence, T^m] can be calculated for all values of m 
by utilizing FFT. Based on the analysis presented above, 
many of these values will be zero for high data rate users. In 
this exemplary embodiment, only the non-zero values are 
stored in order to conserve storage space. The values of m 
for which TJm] is non-zero can be determined analytically, 
as described in more detail below and illustrated elsewhere 
herein. 

Storage and Retrieval of r -matrix Elements 

[0395] As discussed above, the values of the T-matrix 
elements which are non-zero need to be determined for 
efficient storage of the T-matrix. For high data rate users, 
certain elements cjn] are zero, even within the interval 
n«0:N-l, No256. These zero values reduce the interval over 
which Tjjm] is non-zero. In order to determine the interval 
for non-zero values consider the following relations: 



1 N-l 

fik[m) s 2jy-£c;M ■ c k {n - m) 



(22) 



[0396] The index jj for the 1th virtual user is defined such 
that cjn] is non-zero only over the interval n-jjN^Ni+Nx-l. 
Correspondingly, the vector q[n] is non-zero only over the 
interval n-jtN^jkNk+Nfc-l. Given these definitions, rjm] 
can be rewritten as 



l V 

r tt [m] e — £ Jin 4- j t N t ] ■ c k [n 4- j t N t -mj 



(23) 



[0397] The minimum value of m for which rjm] is 
non-zero is 

mmto2-/^W^r-Aic+l (24) 
[0398] and the maximum value of m for which T^m] is 
non-zero is 

mn^-Nrl-jM+M (25) 
[0399] The total number of non-zero elements is then 

[0400] The table below provides the number of bytes per 
L,k virtual-user pair based on 2 bytes per element — one byte 
for the real part and one byte for the imaginary part. 



N k = 256 



128 



64 



32 



16 



8 



I*! - 256 1022 
128 766 
64 638 



766 
510 
382 



638 
382 
254 



574 
318 
190 



542 526 518 
286 270 262 
158 142 134 





N k -256 


128 


64 


32 


16 


8 


4 


32 


574 


318 


190 


126 


94 


78 


70 


16 


542 


286 


158 


94 


62 


46 


38 


8 


526 


270 


142 


78 


46 


30 


22 


4 


518 


262 


134 


70 


38 


22 


14 



[0401] The memory requirements for storing the T matrix 
for a given number of users at each spreading factor can be 
determined as described below. For example, for virtual 
users at spreading factor N q -2 8 " q , q-0:6, where is the qth 
element of the vector K (some elements of K may be zero), 
the storage requirement can be computed as follows. Let 
Table 1 above be stored in matrix M with elements M qql . For 
example, M oo -1022, and M^-766. The total memory 
required by the T matrix in bytes is then given by the 
following relation 



6 ( 6 \ (27) 



q-0 I 



[0402] For example, for 200 virtual users at spreading 
factor N 0 -256, 1^-200806^, which in turn results in 
M byteB -^K o (K o +l)M oo -100(201)(1022)-20.5 MB. For 10 
384 Kbps users, K q -K 0 6 q0 +K 6 6 q6 with K^IO and K^-640, 
which results in a storage requirement that is given by the 
following relations: 

5(11X1002)410(640)(518)+320(641)(14)»6.2 MB. 

[0403] The r- matrix data can be addressed, stored, and 
accessed as described below. In particular, for each pair (l,k), 
k>=l, there are 1 complex Tjjm] values for each value of m, 
where m ranges from to m^^, and the total number 
of non-zero elements is m to tai t=m nuix2 -m min2 + l- Hence, for 
each pair (LJc), k>=l, there exists 2m total time-contiguous 
bytes. 

[0404] In one embodiment, an array structure is created to 
access the data, as shown below: 



struct { 

int m_min2; 

int m_max2; 

int m_total; 
char * Glk; 

} G_info[N_VU _MAXHN_VU_MAX]; 



[0405] The C-matrix data can then be retrieved by utiliz- 
ing the following exemplary algorithm: 



m min2 - G_info[lIk).nv_min2 
nWz D G_uifo(lIk].m_jriax2 
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Nl -m'«N - V(2NJ 
for m' - 0:1 

for q » 0:L-1 

for q* - 0:L-1 

ffWu = Nl - + n,^. 

n^, = maxlm^,^ , m^^J 

^max " m W m m»xl > m nj»x? l 
^ BVnuc >Ea m min 

01 span ■ mm,, - 31 mtn + 1 

suml - 0.0; 

ptil - &0_info[]IkJ.011c(m into ] 
ptrt-Agfrn^'N. + xJ 
while m^p.a > 0 

suml +- (*ptTl++) * (*ptr2++) 

m ipan — 

end 

qm'lllklqlq'] - suml 

end 

end 

end 

end 



[0406] Another method for calculating the T-matrix ele- 
ments, herein referred to as the direct method, performs a 
direct convolution, for example, by employing the SALz- 
convx function, to compute these elements. This direct 
method is preferable when the vector lengths are small. As 
an illustration of the time required for performing calcula- 
tions, The table below provides exemplary timing data based 
on a 400 MHz PPC7400 with 16 MHz, 2 MB L2 cache, 
wherein the data is assumed to be resident in LI cache. The 
performance loss for L2 cache resident data is not severe. 





Ni 


Timing (fits) 


GFLOPS 


1024 


4 


19.33 


1.70 


1024 


8 


29.73 


2.20 


1024 


16 


50.55 


2,59 


1024 


32 


92.32 


Z84 


1024 


64 


176.53 


2.97 


1024 


128 


346.80 


3.47 



[0407] As discussed above, FFT can also be utilized for 
calculating the T-matrix elements. The time required to 
perform a 512 complex FFT, with in-place calculation, on a 
400 MHz PPC7400 with 16 MHz, 2 MB L2 cache is 10.94 
ts for LI resident data. Prior to performing the final FFT, a 
complex vector multiplication of length 512 needs to be 
performed. Exemplary timings for this computation are 
provided in the following table: 



Length 


Location 


Timing (us) 


GFLOPS 


1024 


LI 


4.46 


1.38 


1024 


L2 


24.27 


0.253 


1024 


DRAM 


61.49 


0.100 



[0408] Further, exemplary timing data for moving data 
between memory and the processor is provided in the 
following table: 



Length 


Location 


Timing (//s) 


1024 


LI 


1.20 


1024 


L2 


15.34 


1024 


DRAM 


30.05 



[0409] FIG. 16 illustrates the T-matrix elements that need 
to be calculated when a new physical user is added to the 
system. Addition of a new physical user to the system results 
in adding 1+J virtual users to the systems: that is, 1 control 
channel+J=256/5F data channels. The number 1^ represents 
the number of initial virtual users. Hence there are (Ky+1) 
elements added to the T-matrix as a result of increase in the 
number of the control channels, and J(K v +l)+J(J+l)/2 ele- 
ments added as a result of increase in the number of the data 
channels. The total number of elements added is then 
(J+l)[K v +l+J/2]. If FFT is utilized to perform the calcula- 
tions, the total number of FFTs to be performed is (J+l)+ 
(J+lXl^+l+J/2]. The first term represents the FFTs to 
transform cjn], and the second term represents the (J+l) 
[K^+l+J/2] inverse FFTs of FFT{c k [n]}*FFT{c 1 *[n]}. The 
time to perform the complex 512 FFTs can be, for example, 
10.94 jus, whereas the time to perform the complex vector 
multiply and the complex 512 FFT can be, for example, 
24.27/2+10.94-23.08 /is. 

[0410] In order to provide illustrative examples of pro- 
cessing times, two cases of interest are considered below. In 
the first case scenario, a voice user is added to the system 
while K=100 users (1^=200 virtual users) are accessing the 
system. Not all of these users are active. The control 
channels are always active, but the data channels have 
activity factor AF=0.4. The mean number of active virtual 
users is then K+AF*K=140. The standard deviation is a= 
VK-AF-(1-AF)=4.90. Accordingly, there are K v <140+ 
3a<155 active user with a high probability. 

[0411] The second case, which represents a more demand- 
ing scenario, arises when a single 384 Kbps data user is 
added while a number of users are accessing the system. A 
single 384 Kbps data user adds interference equal to (0.25+ 
0.125*100)/(0.25+0.400*l)~o20 voice users. Hence, the 
number of voice users accessing the system must be reduced 
to approximately K=100-20»80 (1^=160). The 3a number 
of active virtual users is then 80+(0. 125)80+3(3.0)= 99 
active virtual users. The reason this scenario is more 
demanding is that when a single 384 Kbps data user is added 
to the system, J+l=64+l=65 virtual users are added to the 
system. 

[0412] In the first case scenario in which there are 1^=200 
virtual users accessing the system and a voice user is added 
to the system (J=l), the total time to add the voice user can 
be (1+1)(10.94 /*)+(l+l)[200+l+H](23.08 /*)~9.3 ms. 

[0413] For the second scenario in which there are K «160 
virtual users accessing the system and a 384 Kbps data user 
is added (J«64), the total time to add the 384 Kbps user can 
be (64+l)(10.94 /e)+(64+l)[160+l+ 6 %](23.08 /*)=290 ms, 
which is significantly larger than 9.3 ms. Hence, at least for 
high data-rate user, the T-matrix elements are calculated via 
convolutions. 
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[0414] In the direct method of calculating the T-matrix 
elements, the SAL zconvx function is utilized to perform the 
following convolution: 



1 T? 



(28) 



cn*i + JkN k +m}'C k [rn-j k N k ] 



[0415] For each value of m, N min =min {N lf N k } complex 
macs (cmacs) need to be performed. Each cmac requires 8 
flops, and there are m total =Ni+N k -l m-values to calculate. 
Hence, the total number of flops is 8N min (N t +N k -l). In the 
following, it is assumed that the convolution calculation is 
performed at 1.50 GOPs=1500 ops//e. The time required to 
perform the convolutions is presented in the table below 



64 



32 



16 



^-256 


697.69 


261.46 


108.89 


48.98 


23.13 


11.22 


5.53 


128 


261.46 


174.08 


65.19 


27.14 


12.20 


5.76 


2.79 


64 


108.89 


65.19 


43.35 


16.21 


6.74 


3.03 


1.43 


32 


48.98 


27.14 


16.21 


10.75 


4.01 


1.66 


0.75 


16 


23.13 


12.20 


6.74 


4.01 


2.65 


0.98 


0.41 


8 


11.22 


5.76 


3.03 


1.66 


0.98 


0.64 


0.23 


4 


5.53 


2.79 


1.43 


0.75 


0.41 


0.23 


0.15 



[0416] The total time to calculate the T-matrix is then 
given by the following relation: 



[0419] This number is large enough to require that for 
voice users, at least, the T-matrix elements be calculated via 
FFTs. 

[0420] For the second scenario, there are Ky-160 virtual 
users accessing the system and a 384 Kbps data user is added 
to the system (J-64). Hence, K^-K v 6 q0 (SF=256), 1^-160, 
J x =l (control) and J y =J=64 (data). The total time is then 

(/c v +i)r O0 4./(^ v +i)r oe +(/+i)^/2)r 66 -(i6i)(697.7 

/*)+(64)(161)(5.53 ^)+(65)(32)(0.15 /*s)-112.33 
ms+56.98 ms+0.31 ms=169.62 ms 

[0421] Accordingly, these calculations should also be per- 
formed by utilizing FIT, which can require, for example, 
23.08 jus per convolution. In addition, 1 FFT is required to 
compute FFT{c k *[n]}) for the single control channel. This 
can require an additional 10.94 /*s. The total time, then, to 
add the 384 Kbps user is 

10.94 us +(161)(23.08)/<s +(64)(161)(5.53)>/s+ 
(65)(32X0.15>o— 61.02 mat 

T-matrix Elements to SDRAM 

[0422] With reference to above Equation (27), the size of 
the T-matrix in bytes is given by the following relation: 



(31) 



= ^[K-diag{M) + K T -M-K] 



(29) 



Tr(K) 



= -[K.diag(T) + K T TK) 



[0417] where T qq are the elements in the above Table 5. 
Now suppose K'-K+A, where A q =J x 6 qx +J yftqy , and where x 
and y are not equal. Then 



*T r BTr(K')-T r (K) 



= -J X {J X + l)T a + -J y (J, + l)T„ + J,J,T xy + 



(30) 



[0418] In the first scenario, there are K v «200 virtual users 
accessing the system and a voice user is added to the system 
(J-l). Hence, ^-1^^-256), K.-200, J x -J-2 and 



J y -0. The total time is then 



^+1)700+7^00.(0. 
m«)+(2)(200)(0.70 ms). 



.5)(2)(3X0.70 
•283 ms 



[0423] Now suppose K'=K+A, where A q =J x 5 qjc +J y 5 qy , and 
where x and y are not equal. Then 



AA/ b a M b {K') - M b (K) 



(32) 



6 

I 

<7=0 



[0424] Consider a first exemplary scenario in which 
K q =2008 qO (SF=256) and a single voice user is added to the 
system: J x =2 (data plus control), and J y =0. The total number 
of bytes to be written to SDRAM is then 0.5(2)(3)(1022)+ 
200(2)(1022)=0.412 MB. Assuming a SDRAM write speed 
of 133 MHz*8 bytes*0.5o532 MB/s, the time required to 
write T-matrix to SDRAM is then 0.774 ms. 

[0425] For additional illustration of the time required for 
storing the r-matrix, consider a second scenario in which 
K =1608 qO (SF«256), and a single 384 Kbps (SF«4) user is 
axJded to the system: 1^=1 (control) and J v =64 (data). The 
total number of bytes is then 
0.5(l)(2)(1022)+O.5(64)(65)(14)+160{l(1022>+64(518)}- 
5.498 MB. The SDRAM write speed is 133MHz*8 
bytes*0.5»532 MB/s. The time to write to SDRAM is then 
10.33 ms. 
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Packing the Gamma-Matrix Elements in SDRAM 

[0426] In this exemplary embodiment, the maximum total 
size of the P-matrix is 20.5 MB. If it is assumed that in order 
to pack the matrix, every element must be moved (this is the 
most demanding scenario), then for a SDRAM speed of 
133MHz*8 bytes*0.5o532 MB/s, the move time is then 
2(20.5 MB)/(532 MB/s)=77.1 ms. If the r-matrix is divided 
over three processors, this time is reduced by a factor of 3. 
The packing can be done incrementally, so there is no strict 
time limit. 

Extracting Gamma-Matrix Elements from SDRAM 

[0427] As described above, in this exemplary embodi- 
ment, the G-matrix data is retrieved by utilizing the follow- 
ing algorithm: 



- G_Jnfo(lIk].m_jnin2 

- G_Jnfo[lIk}m_max2 

m -m'*N-L b /(2N c ) 
for m' » 0:1 

for q - 0:L -1 

for q' - 0:L -1 

i - m'T + -q q - Xfa,. 
IP mte i - Nl - + iifcq- 

^mi-x - milI [ m iD«xl y m m*>d\ 
if °W >= n\nin 

"apan - m mix ~ t^rnta + 1 

suml - 0.0; 

ptrl - &G_info[lIk].Gll4m 1Di J 
ptr2-&g[m min *N c +x] 
while m^^j > 0 

suml += (*ptrl++) * (*ptr2++) 

m ipao 

end 

Cfm'IIIklqlq*] - suml 

end 

end 

end 

end 



[0428] The time requirements for calculating the r-matrix 
elements in this exemplary embodiment, when a new user is 
added to the system was discussed above. The time require- 
ments for extracting the corresponding C-matrix elements 
are discussed below 

[0429] The T^m] elements are accessed from SDRAM. It 
is highly likely that these values will not be contained in 
either LI or L2 cache. For a given (yc) pair, however, the 
spread in i is likely to be, for most cases, less than 8 /is (Le. 
for a 4 /is delay spread), which equates to (8 /is)(4 chips/ 
/is)(2 bytes/chip)o64 bytes, or 2 cache lines. In an embodi- 
ment in which data is read in for two values of m f , a total of 
4 cache lines must be read. This will require 16 clocks, or 
about 16 /i33«0.12 /is. However, in some embodiments, 
accesses to SDRAM may be performed at about 50% 
efficiency so that the required time is about 0.24 /is. 

[0430] If a user l=x is added to the system, the elements 
Cfm'IxJptlqJq'] for all m', k, q and q* need to be fetched. 
As indicated above, all the m', q and q' values are typically 
contained in 4 cache lines. Hence, if there are virtual 
users, 4Ky cache lines need to be read, thereby requiring 



32Ky clocks, where the number of clocks has been doubled 
to account for the 50% efficiency in accessing the SDRAM. 
In general, addition of J+l virtual users to the system at a 
time, requires 321^+1) clocks. 

[0431] In one example where there are 155 active virtual 
users and a new voice user is added to the system, the time 
required to read in the C-matrix elements can be 32(155)(1+ 
1) clocks/(133 clocks//is)-74.6 /is. The present industry 
standard hold time t h for a voice call is 140 s. The average 
rate X of users added to the system can be determined from 
M h -K, where K is the average number of users utilizing the 
system. For K-100 users, X-100/140 s -1 user are added per 
1.4 s. 

[0432] In another example where there are 99 active 
virtual users and a 384 Kbps user is added to the system, the 
time required to read in the C-matrix elements can be 
32(99)(64+l) clocks/(133 clocks//is)»1.55 ms. However 
data users presumably will be added to the system more 
infrequently than voice users. 

Time to Extract Elements When Changes 

[0433] Now suppose, for example, that user l=x lag q=y 
changes. This necessitates fetching the elements C[m' Jx] 
[k][yjq'] for all m', k and q'. All the q' values will be 
contained typically in 1 cache line. Hence, 2(K V )(1)=2K V 
cache lines need to be read in, thereby requiring 16*^ 
clocks, where the number of clocks has been doubled to 
account for the 50% efficiency in accessing the SDRAM. In 
general, when a time lag changes, there are J+l virtual users 
for which the C-matrix elements need to be updated. Such 
updating of the C-matrix elements can require 161^(1+1) 
clocks. 

[0434] In one example in which 155 active virtual users 
are present and a voice user's profile (one lag) changes, the 
time required to read in the C-matrix elements can be 
16(155)(1+1) clocks/(133 clocks//*s)-373 /is. As discussed 
above, for high mobility users, such changes should occur at 
a rate of about 1 per 100 ms per physical user. This equates 
to about once per 1.33 ms processing interval, if there are 
100 physical users. Hence, approximately 37.3 /is will be 
required every 1.33 ms. 

[0435] In another example where there are 99 virtual users 
and a 384 Kbps data user's profile (one lag) changes, the 
time required to read in the C-matrix elements can be 
16(99)(64+1) clocks/(133 clocks//is)=0.774 ms. However 
data users will have lower mobility and hence such changes 
should occur infrequently. 

Writing C-Matrix Elements to L2 Cache 

[0436], Consider again the case where user l»x is added to 
the system. In such a case, the elements Cfm'XxptJqJq'] 
for all m', k, q and q' need to be written to cache. If there are 
1^ active virtual users, 4K^L 2 bytes need to be written, 
where the number of bytes have been doubled because the 
elements are complex. In general, addition of J+l virtual 
users to the system at a time will require 41^1/(1+1) bytes 
to be written to L2 cache. 

[0437] In one example, there are 155 active virtual users 
and a new voice user is added to the system. In this case, the 
time required to write the C-matrix elements can be 
4(155)(16)(1+1) bytes/(2128 bytes//is)=9.3 /is. 
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[0438] In another example, there are 99 active virtual users 
and a 384 Kbps user is added to the system. In such a case, 
the time required to write the C-matrix elements can be 
4(99)(16)(64+1) bytes/(2128 bytes//*s>193.5 /«. Data users 
are typically added to the system more infrequently than 
voice users. 

Time to Extract Elements When Changes 

[0439] Consider a situation in which for user l=»x lag q=y 
changes. In such a case, the elements CJm'IxIklqlq 1 ] for 
all m', k and q' need to be written. If there are active 
virtual users, 4KJ^ bytes need to be written, where the 
number of the bytes has been doubled since the elements are 
complex. In general, addition of J+l virtual users the system 
at a time will require 4KJL(J+1) bytes to be written to L2 
cache. 

[0440] In one example, there are 155 active virtual users 
and a voice user's profile (one lag) changes. In such a case, 
the time required to write the C-matrix elements will be 
4(155)(4)(1+1) bytes/(2128 bytes/,us)-2.33 ^s. 

[0441] In a second case, there are 99 active virtual users 
and a 384 Kbps data user's profile (one lag) changes. Then, 
the time required to write the C-matrix elements will be 
4(99)(4)(64+l)bytes/(2128 bytes/us)»48.4^s. However data 
users will have lower mobility and hence such changes 
typically occur infrequently. 

Packing C-matrix Elements In L2 Cache 

[0442] In this exemplary embodiment, the C-matrix ele- 
ments are packed in memory every time a new user is added 
to or deleted from the system, and every time a new user 
becomes active or inactive. In this embodiment, the size of 
the C-matrix is 2(3/2)(K.L) 2 -3(K v L) 2 bytes. If three pro- 
cessors are utilized, the size per processor is (KJ-) 2 bytes. 
Hence, the total time required for moving the entire matrix 
within L2 cache is 2(K V L) 2 bytes/(2128 bytes//*), where the 
factor of 2 accounts for read and write. By way of example, 
if there are 155 active virtual users, the time required to 
move the C-matrix elements is 2(155*4) 2 bytes/(2128 bytes/ 
[s) =0.361 ms, whereas if there are 99 active virtual users the 
time required to move the C-matrix elements is 2(99 *4) 2 
bytes/(2128 bytes//«)=0.147 ms. 

Hardware Calculation Of T-matrix Elements 

[0443] As discussed above, the C-matrix elements can be 
represented in terms of the underlying code correlations in 
accord with the following relation: 

l ( 33 ) 
C W K] B — 2, s k [nN c + m'T + 1 - fy] • c] [n] 

1 " P 

ct[p]-cnn) 
= ^2£*[mN e +T]-cfc[*-mM[n] 

1 n m 



-continued 
= £s[m/V, + T).r tt [m] 

M 

Hit [m] m c ' M ' Ck [rt " m ^ 



[0444] The T-matrix represents the correlation between 
the complex user codes. The complex code for user 1 is 
assumed to be infinite in length, but with only N x non-zero 
values. The non-zero values are constrained to be ±l±j. The 
r-matrix can be represented in terms of the real and imagi- 
nary parts of the complex user codes as follows: 

1 (34) 
Titfrn] ■ c'i[n)-Ck[n-m] 

= ^£ {c * [n] ~ Jc ' [n]] - {c * [n " m] + & [rt " m]) 

= W Z {c " [n] ' c * [rt " m] + c ' [nl * c ' k [n " m] + 
« 

jef [n] -c' k [n - m) - jc\ [n] • cf [n - m]} 
= Tflrn] + rgimj + j{rSf[m] - rS[m}} 

[0445] where 

rTw-^Ecrw-c/n-i-] (35) 

^ B ^X c ' [n] ' c,k[n ' m] 



[0446] Consider any one of the above real correlations, 
denoted 

n 

[0447] where X and Y can be either R or I. Since the 
elements of the codes are now constrained to be ±1 or 0, the 
following relation can be defined: 

cfi*Ml-2*i>])^>] (37) 

[0448] where y^n] and m^n] are both either zero or one . 
The sequence m^n] is a mask used to account for values of 
c^n] that are zero. With these definitions, the above Equa- 
tion (4) becomes 
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' n 

[n-/n] 

= (1 -2)f M)-(l -2yf[n-ml).m/ f [n]. 

m£[n - m] 

' n 

m\[n - m] 

= W ( {Z M -»] - 

n / 

W<f Mb^. m<* [rc] -m t y [n-ml 
/■ 



[0449] where © indicates modulo-2 addition (or logical 
XOR). 

[0450] In addition to configurations discussed elsewhere 
herein, FIGS. 17, 18 and 19 illustrate exemplary hardware 
configurations for computing the functions M^m] and 
Nu^tm] for calculating the r-matrix elements. Once the 
functions M^xJm] and Njj/^m] are obtained, the remain- 
ing calculations for obtaining the r-matrix elements can be 
performed in software, or hardware. In this exemplary 
embodiment, these remaining calculations are performed in 
software. More particularly, FIG. 17 shows a register having 
an initial configuration subsequent to loading a code and a 
mask sequences. Further, FIG. 18 depicts a logic circuit for 
performing the requisite boolean functions. FIG. 19 depicts 
the configuration of the register after implementing a num- 
ber of shifts. 

[0451] The four functions Ty^fm] corresponding to X, 
Y=R, I which are components of rjm] can be calculated in 
parallel. For 1^=200 virtual users, and assuming that 10% of 
all (I, k) pairs need to be calculated in 2 ms, then for 
real-time operation, 0.10(200) 2 - 4000 1^8 m] elements (all 
shifts) need to be computed in 2 ms, or about 2M elements 
(all shifts) per second. For 1^=128 virtual users, the require- 
ment drops to 0.81 92M elements (all shifts) per second. 

[0452] In this embodiment, the T^m] elements are cal- 
culated for all 512 shifts. However, not all of these shifts are 
needed. Thus, it is possible to reduce the number of calcu- 
lations per r^m] elements by calculating only those ele- 
ments that are needed. 

[0453] As described in more detail elsewhere herein, in 
one hardware implementation of the invention, a single 
processor is utilized for performing the C-matrix calcula- 
tions whereas a plurality of processors, for example, three 
processors, are employed for the R-matrix calculations, 



which are considerably more complex. In what follows, a 
load balancing method is described that calculates optimum 
R-matrix partitioning points in normalized virtual user space 
to provide an equal, and hence balanced, computational load 
per processor. More particularly, it is shown that a closed 
form recursive solution exists that can be solved for an 
arbitrary number of processors. 

Balancing Computational Load Among Processors 
for Parallel Calculation of R-matrix 

[0454] As a result of the following symmetry condition, 
only half of the R-matrix elements need to be explicitly 

calculated: 

KikCnO-eM-m). (39) 

[0455] In essence, only two matrices need to be calculated. 
One of these matrices is combination of R(l) and R(-l), and 
the other is the R(0) matrix. In this case, the essential R(0) 
matrix elements have a triangular structure. The number of 
computations performed to generate the raw data for the 
R(l)/R(-1) and R(0) matrices are combined and optimized 
as a single number. This approach is adopted due to the reuse 
of the X matrix outer product values (see the above Equation 
(8)) across the two R -matrices. Combining the X matrix and 
correlation values dominate the processor utilization since 
they represent the bulk of the computations. In this embodi- 
ment, these computations are employed as a cost metric for 
determining the optimum loading of each processor. 

[0456] The optimization problem can be formulated as an 
equal area problem, where the solution results in equal 
partition areas. Since the major dimensions of the R-matri- 
ces are given in terms of the number of active virtual users, 
the solution space for the optimization problems can be 
defined in terms of the number of virtual users per processor. 
It is clear to those skilled in the art that the solution can be 
applicable to an arbitrary number of virtual users by nor- 
malizing the solution space by the number of virtual users. 

[0457] With reference back to FIG. 10, the computations 
of the R(l)/R(-1) matrix can be represented by a square 
HJKM while the computations of the R(0) matrix can be 
represented by a triangle ABC. From elementary geometry, 
the area of a rectangle of length b and height h is given by: 

A t =bh. (40) 

[0458] and the area of a triangle with a base width b and 
a height h is given by 

i (41) 
A t = -bk 



[0459] Accordingly, a combined area of a rectangle Aj^ and 
a triangle A n having a common height aj is given by the 
following relation: 

A; = Ari + A ti , (42) 

1 2 
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[0460] wherein Aj provides the area of a region below a 
given partition line. For example, provides the area 
within the rectangle HQRM plus the region within triangle 
AFG. The difference in the area of successive partition 
regions is employed to form a cost function. More particu- 
larly, 



[0464] Hence, the solution of a, is given as follows: 



a; = -l± Jl+a?_ 1 +2a / _i + ^ , 



[0465] The negative roots of the above solution for ^ are 
discarded because the solution space falls in a range [0,1]. 
Although it appears that a solution of % requires first 
obtaining values of a r l, expanding the recursion relations 
of the a, and utilizing the fact that a^, equals 0 results in 
obtaining the following solution for a £ that does not require 
obtaining a i ~l: 



ft = Ai-*M (43) I — (47) 



1 4 1 2 



[0461] For an optimum solution, B^s corresponding to 
i-1, 2, . . . N, where N is the number of processors 
performing the calculations, are equal. Because the total 
normalized load is equal to A N , the load per processor is 
equal to Ajj/N. That is 



[0466] The table below illustrates the normalized partition 
values of two, three, and four processors. To calculate the 
actual partition values, the number of active virtual users is 
multiplied by the corresponding table entries. Since a frac- 
tion of a user can not be allocated, a ceiling operation can be 
performed that biases the number of virtual users per pro- 
cessor towards the processors whose loading function is less 
sensitive to perturbations in the number of users 



Location 


Two processors 


Three processors 


Four processors 


ai 


-1-^(0.5811) 


-1 + V2 (0.4142) 


-1 + ^1(0.3229) 






-H-V3*(0.7321) 


-1 + ^(0.5811) 


03 






-l + y 7 (0.8028) 



An A3 _3_ (44) 



[0462] for i-1, 2,. . . , N. 

[0463] By combining the above equations for Bj, the 
solution for % can be found by finding the roots of the 
following equation: 



-a) +a ; - ^a?_i - ^ = 0. 



[0467] The above methods for calculating the R-matrix 
elements can be implemented in hardware and/or software 
as illustrated elsewhere herein. With reference to FIG. 20, in 
one embodiment, the above calculations are performed by 
utilizing a single card that is populated with four Power PC 
7410 processors. These processors employ the AltiVec 
SIMD vector arithmetic logic unit which includes 32 128-bit 
vector registers. These registers can hold either four 32-bit 
float, or four 32-bit integers, or eight 16-bit shorts, or sixteen 
8-bit characters. Two vector SIMD operations can be per- 
formed by clock. The clock rate utilized in this embodiment 
is 400 Mz, although other clock rates can also employed. 
Each processor has 32 KB of LI cache and 2 MB of 266 
MHz L2 cache. Hence, the maximum theoretical perfor- 
mance level of these processors is 3.2 GFLOPS, 6.4 GOPS 
(16-bit), or 12.8 GOPS (8-bit). In this exemplary embodi- 
ment, a combination of floating-point, 16-bit fixed-point and 
8-bit fixed-point calculations are utilized. 
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[0468] With continued reference to FIG- 20, the calcula- 
tion of the C-matrix elements are performed by a single 
processor 220. In contrast, the calculation of the R-matrix 
elements are divided among three processors 222, 224, and 
226. Further, a RACE++ 266 MB/sec 8-port switched fabric 
228 interconnects the processors. The high bandwidth of the 
fabric allows transfer of large amounts of data with minimal 
Latency so as to provide efficient parallelism of the four 
processors. 

Vector Processor-Based R-Matrix Generation 

[0469] Vector processing is beneficially employed, in one 
embodiment of the invention, to speed calculations per- 
formed by the processor card of FIGS. 2 and 3. Specifically, 
the AltiVec™ vector processing resources (and, more par- 
ticularly, instruction set) of the Motorola PowerPC 7400 
processor used in node processors 228 are employed to 
speed calculation of the R-matrix. These processors include 
a single-instruction multiple -data (SIMD) vector arithmetic 
logic unit which includes 32 128 -bit input vector units. 
These units can hold either four 32-bit integers, or eight 
16-bit integers, or even sixteen 8-bit integers. The clock rate 
utilized in this embodiment is 400 Mz, although other clock 
rates can be also employed. 

[0470] Of course, those skilled in the art will appreciate 
that other vector processing resources can be used in addi- 
tion or instead. These can include SIMD coprocessors or 
Qode processors based on other chip sets, to name a few. 
Moreover, those skilled in the art will appreciate that, while 
the discussion below focuses on use of vector processing to 
speed calculation of the R-matrix, the techniques described 
below can be applied to calculating other matrices of the 
type described previously as well, more generally, to other 
calculations used for purposes of CDMA and other commu- 
nications signal processing. 

[0471] In the illustrated embodiment, a mapping vector is 
utilized to create a mapping between each physical user and 
its associated (or "decomposed") virtual users. This vector is 
populated during the decomposition process which, itself, 
can be accomplished in a conventional manner known in the 
art. The vector is used, for example, during generation of the 
R-matrix as described below. 

[0472] As further evident in the discussion below, the 
X-matrix (see Equation (8)), is arranged such that a "strip- 
mining" method of the boundary elements can be performed 
to further increase speed and throughput. The elements of 
that matrix are arranged such that successive ones of them 
can be stripped to generate successive elements in the 
R-matrix. This permits indices to be incremented rather than 
calculated. The elements are, moreover, arranged in a buffer 
such that adjacent elements can be multiplied with adjacent 
element of the C-matrix, thereby, limiting the number of 
required indices to two within the iterative summation loops. 

[0473] In the discussion that follows, a node processor 228 
operating as a vector processor is referred to as vector 
processor 410. FIG. 21 is a block diagram depicting the 
architecture and operation of one such node processor 228, 
and its corresponding vector processor 410, used in an 
embodiment of the invention to calculate the R-matrix 428 
using integer representations of the C-matrix 424 and wave- 
form amplitudes 426. To facilitate a complete understanding 
of the illustrated embodiments, only a sampling of operands 



are illustrated, e.g., a few elements each of the C-matrix 424, 
complex amplitudes 426 and R-matrix 428. In actual opera- 
tion of a system according to the invention, the vector 
processor 410 can used to process matrices containing 
hundreds or thousands of elements. 

[0474] As shown in the drawing, the illustrated node 
processor 228 is configured via software instructions to 
execute a floating point to integer transformation process 
406 and an integer to floating point transformation process 
412, well as to serve as a vector processor 410. The 
relationship and signalling between these modes is depicted 
in the drawing. 

[0475] By way of overview, and as discussed above, one 
or more code-division multiple access (CDMA) waveforms 
or signals transmitted, e.g., from a user cellular phone, 
modem or other CDMA signal source are decomposed into 
one or more virtual user waveforms. The virtual user is 
deemed to "transmit" a single bit per symbol period of that 
received CDMA waveform. In turn, each of the virtual user 
waveforms is processed according to the methods and 
systems described above. 

[0476] In some embodiments, waveform processing is 
performed using floating-point math, e.g., for generating the 
gamma-matrix, C-matrix, R-matrix, and so on, all in the 
manner described above. However, in an embodiment of the 
invention, e.g., reflected in FIG. 21, integer math is per- 
formed on the vector processor 410, taking advantage of 
block-floating point representation of the operands. This 
speeds waveform processing, albeit at the cost of accuracy. 
However, in the illustrated embodiment, a balance is 
achieved by through use of 16-bit block-floating point 
representation, e.g., in lieu of conventional 32-bit floating- 
point representations. Those skilled in the art will appreciate 
that block-floating representations of other bit widths could 
be used instead, depending on implementation requirements. 

[0477] Referring to FIG. 21, the C-matrix 424 is gener- 
ated by the node processor 228 as described above, and is 
stored in memory accordingly in a floating-point represen- 
tation, e.g., C 0 401, Cj 402, and so on. Further, the ampli- 
tudes 426 are stored in memory as floating-point represen- 
tations. Both sets of representations are transformed into 
floating-block format via a transformation process 406 
which generates a common exponent 414 and a 16-bit 
integer for each operand. The transformation process 416 
stores two integers in each word, e.g., Cq 408a, C a 4086, and 
jio 409a, a j 4096, and the corresponding block exponent 414. 
The transformation process 414 can be performed via spe- 
cial purpose function or through use of extensions to the C 
programming language, as can be seen in a programming 
listing that is further described. The integers stored in 
memory, e.g., 408, 409, are moved by the transformation 
process 406 to the vector processor 410 for processing. 

[0478] The vector processor 410 includes two input vector 
units 416, 418, an output vector unit 420, and an arithmetic 
processor 422. Each vector unit is 128-bits in length, hence, 
each can store eight of the 16 -bit integer operands. The 
arithmetic processor 422 has a plurality of operating ele- 
ments, 422a through 422c. Each of the operating element 
422a through 422c applies functionality to a set of operands 
stored in the input vector units 416, 418, and stores that 
processed data in the output vector 420. For example, the 
operating element 422a performs functionality on operands 
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C 0 416a and a 0 418a and generates R 0 420a. The arithmetic 
processor 422 can be programed via C programming instruc- 
tions, or by a field programmable gate array or other logic. 

[0479] Although vector processor 410 includes two input 
vector units 416, 418, in other embodiments it can have 
numerous vector units, that can be loaded with additional 
C-matrix and complex amplitude representations at the same 
time. Further, the operands can be stored in a non-sequential 



order to accommodate increased throughput via storing 
operands according to a first-used order. 

[0480] As noted above, one way to program the arithmetic 
processor 422 is through extensions to a high level program- 
ming laaguage. One such program, written in C, suitable for 
instruction the vector processor 422 to generate the R-matrix 

is as follows: 



#inchidc "tnudlib.h" 
#define DO_CALC_STATS 0 
#define DO_TRUNCATE 1 
#define DO_SATURXTE 1 
#define DO_SQUELCH 0 
#dcflnc SQUELCFL_THRESH 1.0 
#define TRUNCATE_BIAS 0.0 
#if D 0_TRU NC ATE 

#define SATURATE_THRESH (128.0 + TRUNCATE_BIAS) 
mac 

#dcfinc SAmRATE_THRESH 127.5 
#endif 

#definc SATURATE( f ) \ 
( V 

if ( (f) >= SATURATE_THRESH ) f = (SATURATE_THRESH - 1.0); \ 
else if ( (£) < - SATURATE_THRESH ) f = -SATURATE_THRESH; \ 

} 

#if DO_TRUNCATE 
#if 0 

#define BF8_FDC( f ) ((BF8) (FABS(0 <» TRUNCATED IAS) ? 0 : \ 
(((f) > 0.0) ? ((f) - TRUNCATE BIAS) : \ 
((f) + TRUNCATE_BtAS))) 
#define BF8_FK( f ) ((BF8) (f)) 
#else 

#define BF8_FK( f ) ((BF8) (((((f) < 0.0)) && ((f) = (float) ((int) (f)))) ?\ 
((£) + 1.0) : (£))) 

#endif 
#else 

^define BF8_FIX( f ) ((BF8) (((£)>- 0.0) ? ((f>0.5) : ((f)-0.5))) 
#endif 

^define UPDATE „MAX( f , max ) \ 

if ( FABS( f ) > max ) max - FABS( f ); 
#deftne uchar unsigned char 
#defme ushort unsigned short 
^define ulong unsigned long 
*if DO_CALC_SIATS 
static float max_R_vaIue; 
#endif 

void gen_X_row ( 

OOMPLEX_BF16 •mpathl_bf, 
OOMPLEX_BF16 *mpath2_bf, 
COMPLEX_BF16 *X_bf, 
int phys_index, 
int tot_phys_ L .users 

); 

void gen_R_sums ( 

COMPLEX_BF16 *X_bf, 
COMPLEX_BF8 *corr_bf, 
uchai *ptov_map, 
BF32 *R_sums, 
int num_phys_users 

); 

void gen _ J R_sums2 ( 

COMPLEX_BF16 *X_bf, 
COMPLEX_BF8 *corra_bf, 
COMPLEX_BF8 *corrb_bf, 
uchar *ptov_map, 
BF32 *R_5umsa, 
BF32 *R_sumsb, 
int num__phys_user8 

); 

void gen_J*_matrices ( 

BF32 *R_sums, 
float *bf_scalep, 
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float *inv_scalep, 
float *scalep, 
BF8 *no_scale_row_bf» 
BF8 *scale_j-owJbf, 
int num_virt_users 

); 

void mudlib__gen__R ( 

COMPLFJC_BF16 *mpathl_bf, 



BCOMPLEX_BF16 *mpath2_bf, 
COMPLEX_BF8 •corr_0_bf, 



COMPLEX_BF8 »corr_l_bf, 



uchar *ptov_map, 



float *bf_scalcp, 



float *inv_scalcp, 
float *scalep, 
char *Ll_cachep, 



BF8 *RO_upper_bf, 

BF8 *R0_Jowcr_bf, 
BF8 *Rl_trans_bf, 
BF8 *Rlm_bf, 

int tot_phys users, 

int tot_virt users, 

int start_j)hys_uscr, 



int starL_virt_user, 



int end_phys_user, 



int end_virt_user 



/* ANTENNA DATA 1: TWO AMPLITUDE 
DATA VALUES a hat FOR EACH USER 
V 

/* ANTENNA DATA 2 7 
/* adjusted for starting physical 
user */ 

/• C MATRDC, LE, C(0), SYMBOL YOU 
ARE ON VERSUS OTHER SYMBOLS 7 

/* adjusted for starting physical 
user "/ 

/* C MATRDC. THIS IS A VIRTUAL 
USER BY VIRTUAL USER MATRIX. 
EACH USER HAS 16 VALUES THAT 
CORRELATE THAT USER TO OTHER 
USERS 7 

/•no mare than 256 virts. per phys 
7 

/• MAPPING OF PHYSICAL TO 

VIRTUAL USERS MAP. IN FURTHER 
EMBODIMENTS, THIS COULD 
DYNAMICALLY CHANGE AS USERS 
ENTER INTO AND LEAVE SYSTEM 7 
/" scalar: always a power of 2 7 
/• VECTOR WITH SCALAR FOR EACH 

VIRTUAL USER — NOTWITHSTANDING 
V 

/* start at 0'th physical user 7 
/• start at 0'th physical user 7 
/• temp: 32K bytes, 32-byte aligned 
7 

/• OUTPUTS (BEGINNING AT NEXT LINE) 
7 

/* UPPER PART OF R(l) MATRIX — A 

TRIANGULAR PACKED MATRDC 7 
/* LOWER PART OF R(0) MATRIX 7. 
/" TRANSPOSED FORM OF R(0) 7 
/* R(-l) — > V STANDS FOR -1 7 
/• TOTAL PHYSICAL USERS 7 
/* SUM OF VIRTUAL USERS 7 
/• zero-based starting row 

(inclusive) 7 
/• STARTING PHYSICAL USER TO WHICH 

THIS PROCESSOR IS ASSIGNED 7 
/* relative to start__phys_user */ 
/' STARTING VIRTUAL USER TO WHICH 

THIS PROCESSOR IS ASSIGNED 7 
/• NOTE: THIS IS AN ADVANTAGE 

IN ALLOWING US TO PARTITION A 

GIVEN PHYSICAL USER TO MULTIPLE 

PROCESSORS 7 
/* zero-based ending row inclusive) 

7 

/* SAME AS ABOVE, BUT END VALUES 7 
/• relative to end_phys_user 7 



{ 



COMPLEX_BF16 *X_b£ 
BF32 *R__numsQ, *R_sumsl; 



r BEGINNING OF PARTITIONING AND 
PARAMETER SET-UP LOGIC 7 
uchar *R0_ptov_map; 
int bump, byte_offiset, i, iv, lasL_virt_user; 

int R0_align, R0_skipped_virt_users, RO^Jcols, RO_virt_usere, Rl_tcols; 
mi DO_CALC_STATS 

max_R_value » 0.0; 
#endif 

X_bf - (COMPLEX_JBF16 *)Ll_cachcp; 

byte_offset . tot_phys_users * NUM_FENGERS_SQUARED * sizeof(COMPLEX_ 
BF16); 

R_sums0 - (BF32 *) (((ulong)X_bf + byte_oSset + R__MATRDC_AUGN_MASK) & 
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~R_MATRDC_AUGN_MASK); 
byte_o&et - tot_virt_users * sizeof(BF32); 

R_sumal = (BF32 •) (((ulong)R_sum50 + bytc_offiset + R_MATRDOUJGN_MASK) & 

~R_MATRIX _AUGN_MASK); 
R0_ptov__map - (uchar *) (((ulong) R_sumsl 4 byte_offset 4 R_MATRDC_ J ALIGN_ 

MASK) & ~R_MATWX_AHGN_31ASK); 
Rl_tcols = (tot_virt_users + R__MATREX__ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK; 
RO_virt_users - 0; 

for ( i - start__phys_user; i < tot__phys_users; i++ ) { 
R0_virt_users 4- (int)ptov_jnap[i]; 
R0_ptov_jnap [i] « ptov_map[i]; 

RO_ptov_map[Btart_phys_user] — 5tart^_virt_uscr; 

RO_sk^)ped_virt_uscrs = tot_virt__usere - R0_virt_users 4 start_virt_user; 
RO_virt_uscrs — (start_virt_uscr + 1); 
— inv_scalep; /* predecrement to allow for common 

indexing 7 

for ( i - start_phys_user, i <- end_phys_user, i++ ) { /* LOOP OVER ALL 

PHYSICAL USERS 
(ASSIGNED 
TO TEAS 
PROCESSOR) */ 

gen_JC_row ( /* FIND C CODE THAT PERTAINS TO THIS •/ 

mpathl_bf, 
mpath2_bf, 
X_bf, 
K 

tot__phys_u8ers 

); 

— RO_ptov_map[i]; /* excludes R0 diagonal •/ 
last__virt_user - (i < end__phys_user) ? ((int)ptov_map[i] - 1) : 
end_virt_user; 

for ( iv - start„virt_user, (iv + 1) <- last_virt__user; iv +- 2 ) { 
gen_jL_sums2 ( 

X_bf 4 (i * NUM_F1NGERS_SQUARED), 
corr_0_bf, 

corr_0_bf + ((R0_virt_users - 1) * NUM__FINOERS_SQUARED), 
R0_ptov_map + i, 

R_j$ums0 + (R0_)5kipped__virt_user8 + 1), 
R_8umsl 4 (R0_sJdpped_virt_users 4 1), 
tot_phys_users - i 

); 

R0„tcols » Rl_tcols - (R0_skipped_virt_users & ~R_MATRIX^ALIGN_MASK.); 
R0_align - (RO_skippcd_viit_uscis & R_MATRLX_^ALIGN_31ASK) 4 1; 
gen_R_jnatrices ( 

R_sums0 + (R0_skipped_virt_user8 + 1), 

bf_8calep, 

inv_scalep +<RO_$krpped_virt_users + 1), 
scalep + ^0_skippcd_virt_users 4 1), 
R0_Jower_bf +• R0_align, 
R0_upper_bf 4 R0_align, 
R0_virt_uscrs 

); 

R0_uppei_b£l R0_align - 1 ] - 0; /* zero diagonal element */ 
R0_lowei_bf 4= R0_tcols; 
R0_upper_bf 4= R0__tcols; 

R0_tcols - Rl_tcols - ((R0_3kipped_virt_users + 1) & 

~ R_MATRDC_ALI GN_M AS K); 
R0_align - ((R0__skipped_virt_users 4 1) & R_MATR[X_ALIGN_MASK) 4 1; 
gen_R_matrices ( 

R_sumsl 4 (R0_skipp ed_virt_users 4 2), 

bf_scalep, 

inv_scalep + (R0_skipped_virt_users 4 2), 
scalep 4 (R0_skipp ed__virt_us«s 4 2), 
R0_lower_bf + R0_align, 
RO_upper_bf 4 R0__align, 
R0_virt_users - 1 

); 

R0_upper_bfl R0_align - 1 ] - 0; /• zero diagonal element •/ 
R0 Jower_bf 4- R0_tcols; 
•R0_uppex_bf 4- R0__tcols; 
/* 

* create ptov_jnap[i] number of 32-element dot products involving 

* X__bf[i] and corr_l _bf[ilj] where 0 < j < ptov__map{i] 
*/ 

gcn_R_sums2 ( 
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XJbf, 

corr_l_bf, 

corr_l_bf + (tot_virt_users * NUM_FINGERS_SQUARED), 

ptov_jnap, 

R^sumsO, 

R__sumsl, 

tot_phys_users 

); 
r 

* scale the results and create two output rows (1 per matrix) 
V 

gen_JL_matrices ( 
R_jjumsO, 
bJLscalep, 

inv_Bcalep + (RO_skipped_viit_users + 1), 
scalep, 

Rl_trans_bf, 

Rlmjbf, 

tot_virt__users 

); 

Rl_trans_bf Rl__tcols; 
Rlm_bf += Rl_tcols; 
gen_R__matrices ( 

R^jsumsl, 

bf_scalep, 

inv_scalep + (RO_skippcd__viit_users + 2), 
scalep, 

Rl_trans_bf, 

Rlm__bf, 

tot_virt_users 

); 

Rl_trans_bf += Rl_tcols; 
Rlm__bf +- Rl_tcols; 

corr_0_bf += (((2 * RO_virt_uscis) - 1) * NUM_FINGERS_SQUARED); 
corr_l_hf += ((2 • tot_virt_users) * NUM_FDSTGERS_SQUARED); 
RO_ptov_map[i] — 2; 
RO_virL_users -= 2; 
RO_skipped_virt_users += 2; 

if ( iv <= last_virt_user ) { 

bump - R0_ptov_jnap[ i ] ? 0 : 1; 

X_bf + ((i + bump) * NUM_FINGERS_SQU ARED) , 
corr_0_b£ 

RD_ptov_jnap + i + bump, 

R_sums0 + (RO_s]ripped_virt_users + 1), 

tot _phys_users - i - bump 

); 

R0_tcols - Rl_tcols - (RO_skipped_virt_users & -R_MATRIX_ALIGN_31ASK); 
R0_align = (R0 _skippcd_viit_uscrs & RJV1ATRIX_ALIGN_MASK) + 1; 
gen_^ R_jnatrices ( 

R_sums0 + (RO_slripped_virt_users + 1), 

bf_scalep, 

inv_scalep + (RO_skipped_viit_users +1), 
scalep + (R0_slrippe4_virU_users + 1), 
RO_lower_bf + R0_align, 
R0_uppcr_bf + R0_align, 
RD__virt_usexs 

); 

R0_uppcr_bfl R0_align - 1 }» 0; /* zero diagonal clemeat */ 
R0_Jowei_bf += R0_tcols; 
R0_uppci_bf +- R0_tcols; 
/• 

* create ptov_jnap[i] number of 32-element dot products involving 

* X_bf[i] and corr_l_bfliIj] where 0 < j < ptov_map[i] 
V 

gen_JEL_sums ( 
X_bf» 
corr_l_bf, 
ptov_map, 
R__sumsO, 
tot_ph.ys_userfl 

); 
r 

* scale the results and create two output rows (1 per matrix) 
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gcn_R_matriccs ( 
R_sums0, 
bf__scalcp, 

inv_scalcp + (RO_alripped_virt_users + 1), 
scalep, 

Rl_trans_bf, 

Rlm_bf, 

tot_virt_users 

); 

Rl„trans_bf +- Rl_lcols; 
Rlm_bf +- Rl_tcols; 

coir_0_bf +- (RO_viit_users * NUM_FINGERS_SQUARED); 
con_l_bf +- (tot_virt_users • NUM_FINGERS_SQUARED); 
RO_ptov_map[i] — 1; 
RO_virt_users — 1; 
R0_skippcd_viit_usci8 +« 1; 

} 

start_virt_user -0; /* for all subsequent passes */ 

} 

#if DO_CALC_STATS 

printf( "max_R_valuc - %f\n", max_R_value ); 
if ( max^R—vatue > 127.0 ) 

prinlf ( OVERFLOW ••••*\n" ); 

#endif 
} 

m COMPtLE_C 

/• OUTPUT PRODUCT OF TWO 
ANTENNAS */ 
void gcn_X_row ( /• EACH ANTENNA HAS TWO 
VALUES PER PHYSICAL USER 
7 

COMPLEX. *mpathl_bf, /* 2ND ANTENNA IS DIVERSITY ANTENNA V 

COMPLEX_BF16 *mpath2_bf, I* RESULTING OUTPUT PRODUCT IS REP'D BY 

X sub 1* «/ 

COMPLEX_BF16 *X_bf, 
int phys_index, 
int toL_phys_uscrs 

) 

{ 

COMPLEX_BF16 *in_jnpathlp, *in_mpatn2p; 

COMPLEX_BF16 *out^mpathlp, *out_jnpath2p; 

int i, j, q, ql; 

BF32 sir, sli, s2r, s2i; 

BF32 air, ali, a2r, a2i; 

BF32 cr, ci; 

out_mpathlp - mpathl_bf + (phys_index * NUM_FINGERS); 

out_jnpath2p ° mpath2_bf + (phys_Jndex * NUM__FINGERS); 

for ( i « 0; i < tot_phys_users; i++ ) { 

in_jnpathlp - mpathl_bf + (i * NUM__FTNGERS); /* 4 complex values */ 
in_mpath2p - mpath2_bf + (i * NUM_J 7 TNGERS); /* 4 complex values V 
j - 0; 

for ( ql - 0; ql < NUM_FINGERS; ql++ ) { 
sir " (BF32) out_mpathlp[ql].real; 
sli - (BF32) out_mpatrilp[ql].imag; 
s2r = (BF32) out_jnpath2p[ql].real; 
s2i = (BF32) out_mpath2p[ql].imag; 
for ( q - 0; q < NUM_FINGERS; q++ ) { 

air - (BF32) in_mpathlp[qlreal; 

ali - (BF32) in__mpathlp{q].imag; 

a2r » (BF32) in_mpath2p[q].real; 

a2i = (BF32) irujnpath2p[q].imag; 

ct - (air • sir) + (ali * sli); /* COMBO OF TWO ANTENNAS — 
COULD BE MORE, OF COURSE 
V 

ci - (air * sli) = (ali * sir); /* cr IS REAL PART OF 

ELEMENT OF X-MATRTX 7 

cr +- (a2r * s2r) + (a2i • s2i); 
ci +- (a2r • s2i) - (a2i • s2r); 

X_bfli * NUM^FINGERS_SQUARED + j].real - (BF16) (cr » 16); /* 

BLOCK 

X_bfji * NUM_FINGERS_SQUARED + jj.imag = (BF16) (ci » 16); 
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void gen_R_sums ( 

COMPLEX_BF16 *X_bf, 
COMPLEX_BF8 •corr_bf, 
uchar •ptov_map, 
BF32 *R_sums, 
int num_phy8_users 

) 

{ 

int i, j, k; 
BF32 sum; 

for ( i = 0; i < nuirL_phys_uscrs; i++ ) { 
for ( j - O, j < (int)ptov_jnap[i]; j++ ) { 
sum ™ 0; 
• for ( k - 0; k < 16; kf+ ) { 

sum +- (BF32) X_bf[k].real * (BF32) coir_bf->real; 
sum +» (BF32) X_bf[k].imag * (BF32) corr_bf->imag; 
++corr_bf; 

} 

•R_sums++ » sum; 

} 

X_bf +- NUM_FINGERS_SQUARED; 



void gen_R_sums2 ( 

COMPLEX-BF16 *X_bf, 
COMPLEX_BF8 *coira_bf, 
COMPLEX_BF8 'corrb_bf, 
uchar •ptov_map, 
BF32 *R_sumsa, 
BF32 *R__sumsb, 
int num_phys_users 

) 

{ 

int i, j, k; 

BF32 suma, sumb; 

for ( i - 0; i < num__phys_users; i-H- ) { 
for ( j » 0; j < (int)ptov_map[i]; j++ ) { 
suma ■ 0; 
sumb - 0; 

for ( k » 0; k < 16; k++ ) { 

suma +- (BF32) XL_bf[k].real * (BF32) corra_bf->real; 
suma +- (BF32) X_bf[kjimag * (BF32) coira_bf->imag; 
sumb (BF32) X_bf[k].real • (BF32) corrb_bf->real; 
sumb +- (BF32) X_bf[k].imag * (BF32) conb_bf->tmag; 
++corra_bf; 
■»-+corrb_bf; 

} 

•R_sumsa++ - suma; 
*R_sumsb++ - sumb; 

> 

X_bf 4- NXJM_FINOERS_SQUARED; 

} 

> 

void gen_R_matrices ( 

BF32 "IL-Sums, 
float "b£_scalep, 
float *inv_scalcp, 
float *scalep, 
BF8 *no_scale_jow_bf, 
BF8 *sca]c_jow_bf, 
int num_virt_users 

) 

{ 

int i; 

float bf_scale, fsum, fsum_scalc, inv_scalc, scale; 

bOcale » *bf_scalep; 

inv_scale - *inv_scalep; 

for ( i - 0; i < nurrL_virt«jusere; i++ ) { 

scale - scalep[ij 

fsum - (float) (R_sums[iD; 

fsum *» bf__scale; 

f8um_scale - fsum • inv_jscale; 

fBum_scale *■ scale; 
#if DO_CALC_STATS 

UPDATE_MAX( fsunujscale, max_jl_ value ) 

UPDATE__MAX( fsum, max_R_value ) 
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#eodif 

#if DO_SQUELCH 

if ( FABS( fsujn_*calc ) <- SQUELCH__THRESH ) £sum_scalc - 0.0; 
if ( FAES( feum) <- SQUELCH_THRESH ) fsum - 0.0; 

fendif 

#if DO_SATURATE 

SATURATE ( fsum_scale ) 
SATURATE ( fsum ) 

#endif 

no_scale_row_bfli] - BF8_FDC( fsum ); 
scalc_row_bfIi] = BF8_FIX( fsum_scalc ); 

} 

} 

#endif r COMPILE_C V 



[0481] A transformation process 412 transforms the block- 
floating representations stored in the output vector unit 420 
into floating point representations and stores those to a 
memory 428. Here, the R -matrix elements stored in the 
output vector unit 420 are transformed into floating-point 
representations, which can then be used in the manner 
described above for estimating symbols in the physical user 
waveforms. 

[0482] In summary, sufficient throughput can be achieved 
with necessary accuracy using a vector processor 410 apply- 
ing integer math on 16-bit block-floating integers. Of course, 
in other embodiments, different block-floating sizes can be 
used depending on such criteria as the number of users, 
speed of the processors, and necessary accuracy of the 
symbol estimates, to name a few. Further, like methods and 
logic described can be used to generate other matrices (e.g., 
the gamma-matrix and the C-matrix) and to perform other 
calculations within the illustrated embodiment. 

[0483] A further understanding of the operation of the 
illustrated and other embodiments of the invention may be 
attained by reference to (i) U.S. Provisional Application 
Serial No. 60/275,846 filed Mar. 14, 2001, entitled 
"Improved Wireless Communications Systems and Meth- 
ods"; (ii) U.S. Provisional Application Serial No. 60/289, 
600 filed May 7, 2001, entitled "Improved Wireless Com- 
munications Systems and Methods Using Long-Code Multi- 
User Detection"; and (iii) U.S. Provisional Application 
Serial Number. 60/295,060 filed Jun. 1, 2001 entitled 
"Improved Wireless Communications Systems and Methods 
for a Communications Computer" the teachings all of which 
are incorporated herein by reference, and a copy of the latter 
of which may be filed herewith. 

[0484] The above embodiments are presented for illustra- 
tive purposes only. Those skilled in the art will appreciate 
that various modifications can be made to these embodi- 
ments without departing from the scope of the present 
invention. For example, the processors could be of makes 
and manufactures and/or the boards can be of other physical 
designs, layouts or architectures. Moreover, the FPGAs and 
other logic devices can be software or vice versa. Moreover, 
it will be appreciated that while the illustrated embodiments 
decomposes physical user waveforms to virtual user wave- 
forms, the mechanisms described herein can be applied, as 
well, without such decomposition, and that, accordingly, the 
terms "waveform" or "user waveform" should be treated as 
referring to either physical or virtual waveforms unless 
otherwise evident from context. 



Therefore, in view of the foregoing, what we claim is: 

1. A communications device for detecting user transmitted 
symbols encoded in spread spectrum waveforms (hereinaf- 
ter "user waveforms") comprising 

a digital signal processor (hereinafter "DSP") that pro- 
cesses user waveforms to determine characteristics 
thereof, the DSP having an associated memory and an 
associated direct memory access (hereinafter "DMA") 
controller that controls access to that memory, 

a programmable logic device (hereinafter "PLD") that is 
coupled to the DMA controller and that configures it to 
move data relating to user waveform characteristics 
from the memory to a buffer external to the DSP. 

2. The device of claim 1, wherein the PLD configures the 
DMA controller to move the data from the memory to the 
buffer in blocks. 

3. The device of claim 2, wherein the PLD configures the 
DMA controller to move the data from the memory to the 
buffer in unfragmented blocks. 

4. The device of claim 2, wherein the PLD configures the 
DMA controller to move the data from the memory to the 
buffer in fragmented blocks. 

5. The device of claim 4, wherein the PLD formats the 
fragmented blocks in the buffer for subsequent defragmen- 
tation. 

6. A communications device for detecting user transmitted 
symbols encoded in spread spectrum waveforms (hereinaf- 
ter "user waveforms") comprising 

a first-in first-out buffer comprising a dual-port random 
access memory, 

a digital signal processor (hereinafter "DSP") that pro- 
cesses user waveforms to determine characteristics 
thereof, the DSP having an associated memory and an 
associated direct access memory (hereinafter "DMA") 
controller that controls access to that memory, 

a programmable logic device (hereinafter "PLD") that is 
coupled to the DMA controller and that configures it to 
move data relating to user waveform characteristics 
from the memory to the buffer external to the DSP. 

7. The device of claim 6, wherein the programmable logic 
device is any of a field programmable gate array and a 
applications specific integrated circuit. 

8. The device according to claim 6, comprising a multi- 
port data switch coupled with the PLD. 

***** 
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